docs(diagnostics): add REMOTE_DIAGNOSTICS_ROADMAP.md with 140+ tasks across 4 phases
Complete roadmap for remote debug tracing system with: - Phase 1: Server foundation (types, repository, routes, 38+ tests) - Phase 2: Client SDKs (TypeScript, Swift, Kotlin) - Phase 3: Admin UI (Next.js dashboard) - Phase 4: Advanced features (auto-triggers, profiling) Review fixes included: - Fixed partition keys to avoid hot partitions (composite pk) - Added PII redaction patterns (email, SSN, CC, phone, IP, JWT) - Added event bus integration with 8 events - Fixed screenshot storage to use Azure Blob - Added rate limiting specs for all endpoints - Added ETag caching for config polling
This commit is contained in:
parent
03ad80a615
commit
4163e1410a
651
docs/devops/REMOTE_DIAGNOSTICS_ROADMAP.md
Normal file
651
docs/devops/REMOTE_DIAGNOSTICS_ROADMAP.md
Normal file
@ -0,0 +1,651 @@
|
||||
# Remote Diagnostics & Debug Tracing — Implementation Roadmap
|
||||
|
||||
> **Module:** `platform-service/src/modules/diagnostics/`
|
||||
> **Client SDK:** `@bytelyst/diagnostics`
|
||||
> **Target:** E2E debug collection from any device, on-demand triggers, industry-parity features
|
||||
> **Estimated Effort:** 2–3 weeks
|
||||
> **Status:** 🟡 Planning
|
||||
|
||||
---
|
||||
|
||||
## Executive Summary
|
||||
|
||||
This roadmap delivers a **Datadog/Sentry-grade remote diagnostics system** for the ByteLyst ecosystem. Unlike passive telemetry (which we have), this enables **active debugging sessions** where engineers can remotely trigger deep diagnostics collection from any user device.
|
||||
|
||||
### Key Differentiators vs. Existing Telemetry
|
||||
|
||||
| Feature | Existing Telemetry | Remote Diagnostics |
|
||||
| --------------- | ------------------------- | ------------------------------------- |
|
||||
| Trigger | Passive (always sampling) | **Active** (engineer-initiated) |
|
||||
| Log Level | Static config | **Dynamic** (debug/trace per session) |
|
||||
| Network Tracing | None | **Full HTTP capture** |
|
||||
| Breadcrumbs | Basic events | **Rich timeline** (user journey) |
|
||||
| Console Logs | Error-only | **Full capture** (debug→fatal) |
|
||||
| Screenshots | None | **Auto-capture on crash** |
|
||||
| Session Replay | None | **Future: video-style replay** |
|
||||
|
||||
---
|
||||
|
||||
## Phase 1: Server Foundation (Week 1)
|
||||
|
||||
### 1.1 Data Model & Schemas
|
||||
|
||||
- [ ] **1.1.1** Create `modules/diagnostics/types.ts`
|
||||
- [ ] `DebugSessionDoc` — session metadata (status, target, config)
|
||||
- [ ] `DebugTraceDoc` — trace spans with timing
|
||||
- [ ] `DebugLogEntryDoc` — structured log entries
|
||||
- [ ] `DiagnosticsConfigDoc` — per-product collection policies
|
||||
- [ ] Zod schemas for all inputs
|
||||
- [ ] **1.1.2** Add Cosmos containers to `cosmos-init.ts`
|
||||
- [ ] `debug_sessions` (pk: `/id`, TTL: 7 days)
|
||||
- [ ] `debug_traces` (pk: `/pk` with composite `${productId}:${sessionId}`, TTL: 7 days)
|
||||
- [ ] `debug_logs` (pk: `/pk` with composite `${productId}:${sessionId}`, TTL: 3 days)
|
||||
- [ ] `debug_screenshots` metadata (pk: `/sessionId`) — actual images stored in Azure Blob
|
||||
|
||||
### 1.2 Repository Layer
|
||||
|
||||
- [ ] **1.2.1** Create `modules/diagnostics/repository.ts`
|
||||
- [ ] Use `@bytelyst/datastore` `getCollection()` pattern (see `telemetry/repository.ts`)
|
||||
- [ ] `createSession()` — initiate debug session, emit `diagnostics.session.created` event
|
||||
- [ ] `getSession()` — fetch session by ID (cross-partition query via `/id` pk)
|
||||
- [ ] `getSessionForIngest()` — optimized lookup for client ingest (query by `sessionId` field)
|
||||
- [ ] `updateSession()` — status changes, emit `diagnostics.session.updated` event
|
||||
- [ ] `listSessions()` — query by `productId` field with pagination
|
||||
- [ ] `deleteSession()` — manual cleanup, emit `diagnostics.session.deleted` event
|
||||
- [ ] `ingestTrace()` — batch upsert traces (use `upsert()` for idempotency)
|
||||
- [ ] `ingestLogs()` — batch upsert logs with PII scan (reuse `telemetry` PII patterns)
|
||||
- [ ] `getTraces()` — query by composite pk prefix `${productId}:${sessionId}`
|
||||
- [ ] `getLogs()` — query by composite pk with level filters
|
||||
- [ ] `updateSessionStats()` — denormalize logCount/traceCount atomically
|
||||
|
||||
### 1.3 REST API Routes
|
||||
|
||||
- [ ] **1.3.1** Create `modules/diagnostics/routes.ts`
|
||||
- [ ] Apply `requireRole('admin')` for all session management routes
|
||||
- [ ] Apply rate limiting: 10 session creates per admin per hour (prevent abuse)
|
||||
- [ ] `POST /diagnostics/sessions` — create session (admin only)
|
||||
- [ ] Validate target user exists (if userId provided)
|
||||
- [ ] Validate product exists and is active
|
||||
- [ ] Emit `diagnostics.session.created` to event bus
|
||||
- [ ] `GET /diagnostics/sessions` — list sessions (admin only)
|
||||
- [ ] Query params: productId, status, userId, from, to, limit, offset
|
||||
- [ ] Default sort: createdAt desc
|
||||
- [ ] `GET /diagnostics/sessions/:id` — get session details (admin or session owner)
|
||||
- [ ] `PATCH /diagnostics/sessions/:id` — update session (admin only)
|
||||
- [ ] Validate state transitions (pending→active, active→paused, etc.)
|
||||
- [ ] Emit `diagnostics.session.updated` event
|
||||
- [ ] `DELETE /diagnostics/sessions/:id` — cancel session (admin only)
|
||||
- [ ] Soft delete (mark cancelled, don't hard delete for audit trail)
|
||||
- [ ] Emit `diagnostics.session.cancelled` event
|
||||
- [ ] `GET /diagnostics/config` — client polling endpoint (any authenticated user)
|
||||
- [ ] Return active session for this device/user if exists
|
||||
- [ ] ETag support for 304 caching (reduce bandwidth)
|
||||
- [ ] Rate limit: 1 request per 5 seconds per device
|
||||
- [ ] `POST /diagnostics/ingest` — batch trace/log ingestion (any authenticated user)
|
||||
- [ ] Validate session is active for this device
|
||||
- [ ] PII scan all log messages (reuse telemetry PII patterns)
|
||||
- [ ] Batch size limit: 50 items per request
|
||||
- [ ] Async processing for large batches (return 202 Accepted)
|
||||
- [ ] `POST /diagnostics/sessions/:id/screenshot` — upload screenshot metadata
|
||||
- [ ] Generate SAS token via existing `blob` module for direct Azure upload
|
||||
- [ ] Store metadata in `debug_screenshots` container
|
||||
- [ ] Return 201 with blob URL for client upload
|
||||
- [ ] `GET /diagnostics/sessions/:id/screenshots` — list screenshot metadata (admin)
|
||||
- [ ] `GET /diagnostics/sessions/:id/traces` — get traces with pagination
|
||||
- [ ] `GET /diagnostics/sessions/:id/logs` — get logs with level filter, search
|
||||
|
||||
### 1.4 Testing
|
||||
|
||||
- [ ] **1.4.1** Create `modules/diagnostics/diagnostics.test.ts`
|
||||
- [ ] Session CRUD tests (10 tests)
|
||||
- [ ] Create session with valid target user
|
||||
- [ ] Create session fails for non-existent user
|
||||
- [ ] Create session rate limiting (10/hour)
|
||||
- [ ] Get session by ID
|
||||
- [ ] List sessions with filters
|
||||
- [ ] Update session status transitions
|
||||
- [ ] Cancel session (soft delete)
|
||||
- [ ] Session not found after TTL expires
|
||||
- [ ] Unauthorized access blocked
|
||||
- [ ] Event bus emissions verified
|
||||
- [ ] Trace ingestion tests (8 tests)
|
||||
- [ ] Batch trace ingest success
|
||||
- [ ] Trace ingest with invalid session rejected
|
||||
- [ ] Duplicate trace idempotency (upsert)
|
||||
- [ ] Composite pk query by session
|
||||
- [ ] Trace timing validation
|
||||
- [ ] Parent-child span relationships
|
||||
- [ ] Trace with error status
|
||||
- [ ] Large batch rejected (>50 items)
|
||||
- [ ] Log ingestion tests (8 tests)
|
||||
- [ ] Batch log ingest success
|
||||
- [ ] Log with PII redacted (email, SSN)
|
||||
- [ ] Log level filtering
|
||||
- [ ] Invalid session rejected
|
||||
- [ ] Log search by message content
|
||||
- [ ] Log context preservation
|
||||
- [ ] Fatal log triggers alert
|
||||
- [ ] Log TTL enforcement (3 days)
|
||||
- [ ] Config polling tests (6 tests)
|
||||
- [ ] Returns active session for device
|
||||
- [ ] Returns empty when no active session
|
||||
- [ ] ETag 304 caching works
|
||||
- [ ] Rate limit enforced (5 sec)
|
||||
- [ ] Wrong device cannot access other session
|
||||
- [ ] Expired session not returned
|
||||
- [ ] Screenshot tests (6 tests)
|
||||
- [ ] SAS token generation via blob module
|
||||
- [ ] Metadata stored in Cosmos
|
||||
- [ ] Direct Azure Blob upload works
|
||||
- [ ] Screenshot metadata retrieval
|
||||
- [ ] Unauthorized access blocked
|
||||
- [ ] Blob lifecycle tied to session TTL
|
||||
- [ ] **Target:** 38+ Vitest tests (increased from 28)
|
||||
|
||||
### 1.5 Integration
|
||||
|
||||
- [ ] **1.5.1** Wire into `server.ts`
|
||||
- [ ] Import `diagnosticsRoutes` from `./modules/diagnostics/routes.js`
|
||||
- [ ] Register: `await app.register(diagnosticsRoutes, { prefix: '/api' })`
|
||||
- [ ] Add after telemetry routes (logical grouping)
|
||||
- [ ] **1.5.2** Event Bus Integration (`lib/event-bus.ts`)
|
||||
- [ ] Subscribe to `diagnostics.session.created` → Send notification to target user (email/push)
|
||||
- [ ] Subscribe to `diagnostics.session.cancelled` → Notify admin who started session
|
||||
- [ ] Subscribe to `diagnostics.ingest.fatal` → Alert on-call engineer (PagerDuty/Slack)
|
||||
- [ ] Subscribe to `diagnostics.session.completed` → Email summary to admin
|
||||
- [ ] **1.5.3** Audit Logging (`modules/audit/`)
|
||||
- [ ] Log all session lifecycle events (create, update, cancel)
|
||||
- [ ] Include target user ID, admin ID, session config in audit trail
|
||||
- [ ] Retention: 90 days via `audit_log` container TTL
|
||||
- [ ] **1.5.4** Rate Limiting Registration
|
||||
- [ ] Add `diagnostics:session:create` rate limit key (10/hour per admin)
|
||||
- [ ] Add `diagnostics:config:poll` rate limit key (1/5sec per device)
|
||||
- [ ] Add `diagnostics:ingest:submit` rate limit key (100/min per device)
|
||||
|
||||
**Phase 1 Exit Criteria:**
|
||||
|
||||
- [ ] All routes return 200 with correct payloads
|
||||
- [ ] 38+ tests passing (updated from 28)
|
||||
- [ ] Event bus subscribers registered and tested
|
||||
- [ ] Audit logs written for all session operations
|
||||
- [ ] Rate limiting enforced
|
||||
- [ ] PII redaction working in log ingest
|
||||
- [ ] Admin can create session via API
|
||||
|
||||
---
|
||||
|
||||
## Phase 2: Client SDK Abstractions (Week 1–2)
|
||||
|
||||
### 2.1 TypeScript Client SDK
|
||||
|
||||
- [ ] **2.1.1** Create `@bytelyst/diagnostics` package
|
||||
- [ ] `package.json` with ESM exports
|
||||
- [ ] `tsconfig.json` extending base
|
||||
- [ ] **2.1.2** Core types (`src/types.ts`)
|
||||
- [ ] `DiagnosticsSession` interface
|
||||
- [ ] `TraceSpan` interface
|
||||
- [ ] `LogLevel` enum (debug, info, warn, error, fatal)
|
||||
- [ ] `DiagnosticsConfig` from server
|
||||
- [ ] **2.1.3** Main client (`src/client.ts`)
|
||||
- [ ] `DiagnosticsClient` class (singleton)
|
||||
- [ ] `start()` — begin polling for active sessions
|
||||
- [ ] `stop()` — cease polling
|
||||
- [ ] `isSessionActive()` — check current state
|
||||
- [ ] `trace(name, fn)` — auto-instrumented span wrapper
|
||||
- [ ] `log(level, message, context)` — structured logging
|
||||
- [ ] `breadcrumb(message, category)` — add timeline marker
|
||||
- [ ] **2.1.4** Network interceptor (`src/network.ts`)
|
||||
- [ ] `NetworkInterceptor` class
|
||||
- [ ] Wrap `fetch()` for capture
|
||||
- [ ] Capture: URL, method, headers (sanitized), timing, status
|
||||
- [ ] Configurable URL patterns (include/exclude)
|
||||
- [ ] **2.1.5** Breadcrumb trail (`src/breadcrumbs.ts`)
|
||||
- [ ] Ring buffer (max 100 entries)
|
||||
- [ ] Auto-capture: navigation, clicks, errors
|
||||
- [ ] Manual: `breadcrumb()` API
|
||||
- [ ] **2.1.6** Device state (`src/device.ts`)
|
||||
- [ ] Memory, battery, storage, thermal, network type
|
||||
- [ ] **2.1.7** Screenshot capture (`src/screenshot.ts`)
|
||||
- [ ] `html2canvas` integration (browser)
|
||||
- [ ] Auto-capture on error (configurable)
|
||||
- [ ] **2.1.8** Tests (`src/__tests__/diagnostics.test.ts`)
|
||||
- [ ] Client lifecycle tests (4)
|
||||
- [ ] Trace recording tests (4)
|
||||
- [ ] Log buffering tests (4)
|
||||
- [ ] Network interception tests (4)
|
||||
- [ ] Breadcrumb tests (4)
|
||||
- [ ] **Target:** 20+ Vitest tests
|
||||
|
||||
### 2.2 Swift Client SDK (iOS)
|
||||
|
||||
- [ ] **2.2.1** Create `ByteLystDiagnostics` Swift package
|
||||
- [ ] `Package.swift` with iOS 15+ target
|
||||
- [ ] Module structure: Core, Network, UI, Device
|
||||
- [ ] **2.2.2** Core client (`Sources/Core/DiagnosticsClient.swift`)
|
||||
- [ ] `DiagnosticsClient` actor (thread-safe)
|
||||
- [ ] `start()` — polling with `Timer`
|
||||
- [ ] `trace<T>(name, operation)` — async span wrapper
|
||||
- [ ] `log(level, message, metadata)` — os_log integration
|
||||
- [ ] `breadcrumb(category, message)` — timeline
|
||||
- [ ] **2.2.3** Network interception (`Sources/Network/URLSessionInterceptor.swift`)
|
||||
- [ ] `URLProtocol` subclass for automatic capture
|
||||
- [ ] Swizzle `URLSession` or use protocol registration
|
||||
- [ ] Capture: request/response, timing, bytes
|
||||
- [ ] **2.2.4** Device state (`Sources/Device/DeviceState.swift`)
|
||||
- [ ] `UIDevice` integration (battery, thermal)
|
||||
- [ ] `ProcessInfo` (memory pressure)
|
||||
- [ ] `NetworkMonitor` (path status)
|
||||
- [ ] **2.2.5** Screenshot (`Sources/UI/ScreenshotCapture.swift`)
|
||||
- [ ] `UIApplication` key window capture
|
||||
- [ ] Privacy: blur sensitive views (configurable)
|
||||
- [ ] **2.2.6** Tests (`Tests/DiagnosticsClientTests.swift`)
|
||||
- [ ] Unit tests (12)
|
||||
- [ ] Integration tests (8)
|
||||
|
||||
### 2.3 Kotlin Client SDK (Android)
|
||||
|
||||
- [ ] **2.3.1** Create `diagnostics` module in `kotlin-platform-sdk`
|
||||
- [ ] `build.gradle.kts` dependencies
|
||||
- [ ] Coroutines + OkHttp + WorkManager
|
||||
- [ ] **2.3.2** Core client (`diagnostics/DiagnosticsClient.kt`)
|
||||
- [ ] Singleton with `StateFlow<DiagnosticsState>`
|
||||
- [ ] `start()` — foreground service for polling
|
||||
- [ ] `trace()` — suspend function with span
|
||||
- [ ] `log()` — Logcat + structured queue
|
||||
- [ ] **2.3.3** Network interceptor (`diagnostics/OkHttpInterceptor.kt`)
|
||||
- [ ] `Interceptor` implementation
|
||||
- [ ] Capture: request/response chain
|
||||
- [ ] **2.3.4** Device state (`diagnostics/DeviceStateCollector.kt`)
|
||||
- [ ] `BatteryManager`, `ActivityManager`, `StorageStatsManager`
|
||||
- [ ] **2.3.5** Screenshot (`diagnostics/ScreenshotCapture.kt`)
|
||||
- [ ] `MediaProjection` API (with permission)
|
||||
- [ ] `PixelCopy` for surface capture
|
||||
- [ ] **2.3.6** Tests (`diagnostics/DiagnosticsClientTest.kt`)
|
||||
- [ ] Unit tests (10)
|
||||
- [ ] Integration tests (6)
|
||||
|
||||
**Phase 2 Exit Criteria:**
|
||||
|
||||
- [ ] TS SDK builds + 20 tests passing
|
||||
- [ ] Swift SDK builds + 20 tests passing
|
||||
- [ ] Kotlin SDK builds + 16 tests passing
|
||||
- [ ] All SDKs can poll config endpoint
|
||||
|
||||
---
|
||||
|
||||
## Phase 3: Admin Dashboard UI (Week 2)
|
||||
|
||||
### 3.1 Debug Sessions Page
|
||||
|
||||
- [ ] **3.1.1** Create `/ops/debug-sessions/page.tsx`
|
||||
- [ ] Session list table (columns: ID, user, device, status, started, duration)
|
||||
- [ ] Filters: status, product, date range
|
||||
- [ ] Pagination
|
||||
- [ ] "New Session" button → modal
|
||||
- [ ] **3.1.2** New Session Modal
|
||||
- [ ] Target user (email/userId search)
|
||||
- [ ] Target device (dropdown from sessions)
|
||||
- [ ] Collection level (standard, debug, trace)
|
||||
- [ ] Duration slider (5min → 24hr)
|
||||
- [ ] Screenshot on error (toggle)
|
||||
- [ ] "Start Session" → POST API
|
||||
|
||||
### 3.2 Session Detail View
|
||||
|
||||
- [ ] **3.2.1** Create `/ops/debug-sessions/[id]/page.tsx`
|
||||
- [ ] Session header (status badge, user info, device info)
|
||||
- [ ] Action buttons: Extend (+30min), Stop, Download
|
||||
- [ ] Tabs: Timeline, Logs, Network, Traces, Screenshots
|
||||
- [ ] **3.2.2** Timeline Tab
|
||||
- [ ] Breadcrumb list (time, category, message)
|
||||
- [ ] Visual timeline (horizontal bar)
|
||||
- [ ] Click to jump to related trace/log
|
||||
- [ ] **3.2.3** Logs Tab
|
||||
- [ ] Log level filters (debug, info, warn, error)
|
||||
- [ ] Search/filter
|
||||
- [ ] Expandable rows with context
|
||||
- [ ] Syntax highlighting
|
||||
- [ ] **3.2.4** Network Tab
|
||||
- [ ] Request list (time, method, URL, status, duration)
|
||||
- [ ] Click to view: request/response headers, body
|
||||
- [ ] Filter by status code, URL pattern
|
||||
- [ ] **3.2.5** Traces Tab
|
||||
- [ ] Trace tree view (spans, parent-child)
|
||||
- [ ] Timing visualization (waterfall)
|
||||
- [ ] Search by name
|
||||
- [ ] **3.2.6** Screenshots Tab
|
||||
- [ ] Grid of thumbnails
|
||||
- [ ] Click to expand
|
||||
- [ ] Download all as ZIP
|
||||
|
||||
### 3.3 Real-time Updates
|
||||
|
||||
- [ ] **3.3.1** Server-sent events or polling
|
||||
- [ ] Auto-refresh session status every 5 seconds
|
||||
- [ ] Toast notification on new logs/traces
|
||||
|
||||
### 3.4 Client Library
|
||||
|
||||
- [ ] **3.4.1** Create `lib/diagnostics-client.ts`
|
||||
- [ ] `querySessions()`
|
||||
- [ ] `createSession()`
|
||||
- [ ] `getSession()`
|
||||
- [ ] `updateSession()`
|
||||
- [ ] `getTraces()`
|
||||
- [ ] `getLogs()`
|
||||
|
||||
**Phase 3 Exit Criteria:**
|
||||
|
||||
- [ ] Admin can create session from UI
|
||||
- [ ] Session detail shows live data
|
||||
- [ ] All 4 tabs functional
|
||||
|
||||
---
|
||||
|
||||
## Phase 4: Advanced Features (Week 3)
|
||||
|
||||
### 4.1 Automated Triggers
|
||||
|
||||
- [ ] **4.1.1** Error-threshold triggers
|
||||
- [ ] Config: "Start debug session if error rate > X%"
|
||||
- [ ] Background job checks every 5 minutes
|
||||
- [ ] Auto-notify on Slack/Teams
|
||||
- [ ] **4.1.2** Crash-triggered sessions
|
||||
- [ ] Client sends crash → server auto-starts session
|
||||
- [ ] Captures 60 seconds pre-crash context
|
||||
|
||||
### 4.2 Session Replay (Future)
|
||||
|
||||
- [ ] **4.2.1** DOM/View state capture
|
||||
- [ ] Record user interactions (clicks, scrolls, inputs)
|
||||
- [ ] Replay as video-like timeline
|
||||
- [ ] Privacy: exclude password fields
|
||||
|
||||
### 4.3 Performance Profiling
|
||||
|
||||
- [ ] **4.3.1** CPU/Memory profiling
|
||||
- [ ] iOS: `os_signpost` integration
|
||||
- [ ] Android: `Debug.MemoryInfo`
|
||||
- [ ] Web: `performance.now()` + memory API
|
||||
|
||||
### 4.4 Integration Tests
|
||||
|
||||
- [ ] **4.4.1** E2E test: Admin creates session → Client captures → Admin views
|
||||
- [ ] Playwright test (web client)
|
||||
- [ ] XCTest UI test (iOS)
|
||||
- [ ] Espresso test (Android)
|
||||
|
||||
**Phase 4 Exit Criteria:**
|
||||
|
||||
- [ ] Auto-trigger tests passing
|
||||
- [ ] E2E flow working end-to-end
|
||||
|
||||
---
|
||||
|
||||
## Appendix A: Data Models
|
||||
|
||||
### DebugSessionDoc
|
||||
|
||||
```typescript
|
||||
interface DebugSessionDoc {
|
||||
id: string; // ds_<uuid> — also the partition key (/id)
|
||||
productId: string; // For filtering/querying (not pk to avoid hot partitions)
|
||||
|
||||
// Target (at least one required)
|
||||
targetUserId?: string; // For authenticated users
|
||||
targetAnonymousId?: string; // For anonymous users (installId)
|
||||
targetDeviceId?: string; // Specific device fingerprint
|
||||
targetSessionId?: string; // Specific app session to capture
|
||||
|
||||
// Status lifecycle: pending → active → paused → completed | cancelled
|
||||
status: 'pending' | 'active' | 'paused' | 'completed' | 'cancelled';
|
||||
|
||||
// Collection configuration
|
||||
collectionLevel: 'standard' | 'debug' | 'trace';
|
||||
captureLogs: boolean;
|
||||
captureNetwork: boolean;
|
||||
captureScreenshots: boolean;
|
||||
screenshotOnError: boolean;
|
||||
maxDurationMinutes: number; // Default: 60, Max: 1440 (24h)
|
||||
|
||||
// Timestamps
|
||||
createdAt: string; // ISO 8601
|
||||
updatedAt: string; // Last status/config change
|
||||
startedAt?: string; // When status became 'active'
|
||||
endedAt?: string; // When status became 'completed'|'cancelled'
|
||||
expiresAt: string; // Auto-expiry (createdAt + maxDurationMinutes)
|
||||
|
||||
// Stats (denormalized for fast reads, updated via ingest)
|
||||
logCount: number;
|
||||
traceCount: number;
|
||||
screenshotCount: number;
|
||||
|
||||
// Audit
|
||||
createdBy: string; // Admin userId who created session
|
||||
updatedBy?: string; // Last admin to modify
|
||||
|
||||
// Consent tracking (privacy compliance)
|
||||
userConsent?: {
|
||||
consentedAt: string;
|
||||
consentMethod: 'prompt' | 'pre_consent' | 'auto'; // How user agreed
|
||||
};
|
||||
}
|
||||
```
|
||||
|
||||
### DebugTraceDoc (OpenTelemetry-compatible)
|
||||
|
||||
```typescript
|
||||
interface DebugTraceDoc {
|
||||
id: string; // tr_<uuid>
|
||||
pk: string; // Composite: `${productId}:${sessionId}` — partition key
|
||||
sessionId: string; // For queries (also part of pk)
|
||||
productId: string; // For filtering
|
||||
|
||||
// OpenTelemetry trace context
|
||||
traceId: string; // OTel trace ID (hex)
|
||||
parentId?: string; // Parent span ID (null for root)
|
||||
spanId: string; // This span's ID
|
||||
name: string; // Operation name (e.g., "UserLogin", "API.fetchUser")
|
||||
kind?: 'internal' | 'server' | 'client' | 'producer' | 'consumer';
|
||||
|
||||
// Timing (nanosecond precision for OTel compatibility)
|
||||
startTime: string; // ISO 8601
|
||||
endTime?: string;
|
||||
durationMs?: number;
|
||||
|
||||
// Context and attributes
|
||||
attributes: Record<string, unknown>; // Custom key-value pairs
|
||||
status: 'ok' | 'error' | 'unset';
|
||||
statusMessage?: string; // Error description if status=error
|
||||
|
||||
// Events (spans within span — e.g., "db.query", "cache.hit")
|
||||
events?: Array<{
|
||||
name: string;
|
||||
timestamp: string;
|
||||
attributes?: Record<string, unknown>;
|
||||
}>;
|
||||
|
||||
// Links to other traces (for async operations)
|
||||
links?: Array<{
|
||||
traceId: string;
|
||||
spanId: string;
|
||||
attributes?: Record<string, unknown>;
|
||||
}>;
|
||||
}
|
||||
```
|
||||
|
||||
````
|
||||
|
||||
### DebugLogEntryDoc
|
||||
```typescript
|
||||
interface DebugLogEntryDoc {
|
||||
id: string; // log_<uuid>
|
||||
pk: string; // Composite: `${productId}:${sessionId}` — partition key
|
||||
sessionId: string; // For queries (also part of pk)
|
||||
productId: string; // For filtering
|
||||
|
||||
// Log level (matches syslog/OTel severity)
|
||||
level: 'debug' | 'info' | 'warn' | 'error' | 'fatal';
|
||||
message: string; // Original message (PII redacted server-side)
|
||||
messageHash?: string; // SHA-256 for deduplication
|
||||
|
||||
// Timestamp (client clock, server enriches with receivedAt)
|
||||
timestamp: string; // ISO 8601 — when log was generated
|
||||
receivedAt?: string; // Server-side ingestion time
|
||||
|
||||
// Source context
|
||||
module: string; // Component/module name (e.g., "AudioEngine", "SyncManager")
|
||||
file?: string; // Source file path (sanitized)
|
||||
line?: number; // Line number
|
||||
function?: string; // Function/method name
|
||||
|
||||
// Thread/task context
|
||||
threadId?: string; // For multi-threaded apps
|
||||
correlationId?: string; // Links related operations
|
||||
|
||||
// Arbitrary context (PII scanned and redacted)
|
||||
context: Record<string, unknown>;
|
||||
|
||||
// PII redaction metadata
|
||||
redaction?: {
|
||||
fieldsRedacted: string[]; // Which fields were scrubbed
|
||||
patternsMatched: string[]; // Which PII patterns found (email, ssn, etc.)
|
||||
};
|
||||
}
|
||||
````
|
||||
|
||||
### DebugScreenshotDoc (Metadata only — image in Blob)
|
||||
|
||||
```typescript
|
||||
interface DebugScreenshotDoc {
|
||||
id: string; // scr_<uuid>
|
||||
sessionId: string; // Partition key for queries
|
||||
productId: string;
|
||||
|
||||
// Storage reference (actual image in Azure Blob)
|
||||
blobUrl: string; // SAS URL to blob (time-limited)
|
||||
blobPath: string; // Path in container: `screenshots/${productId}/${sessionId}/${id}.png`
|
||||
containerName: string; // Azure Blob container (e.g., "diagnostics-screenshots")
|
||||
|
||||
// Screenshot metadata
|
||||
capturedAt: string; // When captured
|
||||
trigger: 'manual' | 'error' | 'interval' | 'user_request'; // Why taken
|
||||
|
||||
// Dimensions & format
|
||||
width: number;
|
||||
height: number;
|
||||
format: 'png' | 'jpeg' | 'webp';
|
||||
sizeBytes: number;
|
||||
|
||||
// Privacy
|
||||
sensitiveViewsBlurred: boolean; // Whether PII areas were blurred
|
||||
blurRegions?: Array<{ x: number; y: number; w: number; h: number }>; // If partial blur
|
||||
|
||||
// Optional context
|
||||
screenName?: string; // Current screen/view when captured
|
||||
breadcrumbAtCapture?: string; // Last breadcrumb before screenshot
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Appendix B: API Reference
|
||||
|
||||
| Method | Endpoint | Auth | Rate Limit | Description |
|
||||
| ------ | ------------------------------------------- | ----------- | -------------- | --------------------------------------- |
|
||||
| POST | `/api/diagnostics/sessions` | Admin | 10/hour/admin | Create debug session |
|
||||
| GET | `/api/diagnostics/sessions` | Admin | 100/min | List sessions (paginated) |
|
||||
| GET | `/api/diagnostics/sessions/:id` | Admin/Owner | 100/min | Get session details |
|
||||
| PATCH | `/api/diagnostics/sessions/:id` | Admin | 10/min | Update session status |
|
||||
| DELETE | `/api/diagnostics/sessions/:id` | Admin | 10/min | Cancel session (soft delete) |
|
||||
| GET | `/api/diagnostics/config` | Any auth | 1/5sec/device | Poll for active session (ETag cached) |
|
||||
| POST | `/api/diagnostics/ingest` | Any auth | 100/min/device | Submit traces/logs batch (max 50 items) |
|
||||
| POST | `/api/diagnostics/sessions/:id/traces` | Any auth | 100/min/device | Ingest trace spans |
|
||||
| POST | `/api/diagnostics/sessions/:id/logs` | Any auth | 100/min/device | Ingest log entries |
|
||||
| POST | `/api/diagnostics/sessions/:id/screenshots` | Any auth | 10/min/device | Get SAS URL for screenshot upload |
|
||||
| GET | `/api/diagnostics/sessions/:id/traces` | Admin | 100/min | Query traces for session |
|
||||
| GET | `/api/diagnostics/sessions/:id/logs` | Admin | 100/min | Query logs with filters |
|
||||
| GET | `/api/diagnostics/sessions/:id/screenshots` | Admin | 100/min | List screenshot metadata |
|
||||
|
||||
---
|
||||
|
||||
## Appendix C: Industry Comparison
|
||||
|
||||
| Capability | Firebase Crashlytics | Sentry | Datadog RUM | Our Solution |
|
||||
| --------------- | -------------------- | -------- | ----------- | ------------------ |
|
||||
| Crash Reporting | ✅ | ✅ | ✅ | ✅ (via telemetry) |
|
||||
| Error Tracking | ✅ | ✅ | ✅ | ✅ (via telemetry) |
|
||||
| Breadcrumbs | ✅ | ✅ | ✅ | ✅ |
|
||||
| Custom Traces | ⚠️ Limited | ✅ | ✅ | ✅ |
|
||||
| Network Tracing | ❌ | ✅ | ✅ | ✅ |
|
||||
| Console Logs | ⚠️ Error only | ✅ | ✅ | ✅ (all levels) |
|
||||
| Session Replay | ❌ | ✅ | ✅ | 🟡 Future |
|
||||
| Remote Trigger | ❌ | ✅ | ❌ | ✅ |
|
||||
| On-Device Debug | ❌ | ❌ | ❌ | ✅ |
|
||||
| Screenshots | ⚠️ Crash only | ✅ | ❌ | ✅ |
|
||||
| Open Source | ❌ | ✅ (SDK) | ❌ | ✅ |
|
||||
|
||||
---
|
||||
|
||||
## Appendix D: Privacy & Security
|
||||
|
||||
### D.1 PII Redaction Patterns (server-side)
|
||||
|
||||
| Pattern | Regex | Redaction Method | Example |
|
||||
| -------------- | ------------------------------------------------------ | ---------------------- | ------------------------------- | ------------------ | --------------------------- | ------------------------------------------------ |
|
||||
| Email | `[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}` | Replace with `[EMAIL]` | `user@example.com` → `[EMAIL]` |
|
||||
| SSN (US) | `\b\d{3}-?\d{2}-?\d{4}\b` | Replace with `[SSN]` | `123-45-6789` → `[SSN]` |
|
||||
| Credit Card | `\b(?:\d[ -]*?){13,16}\b` | Replace with `[CC]` | `4111 1111 1111 1111` → `[CC]` |
|
||||
| Phone | `\b\+?1?\s?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}\b` | Replace with `[PHONE]` | `+1 (555) 123-4567` → `[PHONE]` |
|
||||
| IP Address | `\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b` | Replace with `[IP]` | `192.168.1.1` → `[IP]` |
|
||||
| Password/Token | `(?i)(password | token | secret | key)\s*[:=]\s*\S+` | Replace with `[CREDENTIAL]` | `password: secret123` → `password: [CREDENTIAL]` |
|
||||
| JWT | `eyJ[a-zA-Z0-9_-]*\.eyJ[a-zA-Z0-9_-]*\.[a-zA-Z0-9_-]*` | Replace with `[JWT]` | Full JWT → `[JWT]` |
|
||||
|
||||
- [ ] **1. PII Redaction:** Implement all patterns above in `lib/pii-redaction.ts` (shared with telemetry)
|
||||
- [ ] **2. Redaction Metadata:** Store which patterns matched in `redaction.fieldsRedacted` for transparency
|
||||
- [ ] **3. Consent Tracking:** `userConsent` field in session doc with `consentedAt` and `consentMethod`
|
||||
- [ ] **4. Data Retention:** 7-day default TTL for sessions/traces, 3-day for logs (Cosmos TTL)
|
||||
- [ ] **5. Access Control:** Admin-only session creation; users can only view their own sessions via `targetUserId` check
|
||||
- [ ] **6. Encryption:** All data encrypted at rest (Cosmos), in transit (TLS 1.3)
|
||||
- [ ] **7. Audit Trail:** All session operations logged to `audit_log` container (90-day retention)
|
||||
- [ ] **8. User Notification:** Email/push notification when debug session started on their device
|
||||
|
||||
---
|
||||
|
||||
## Appendix E: Event Bus Events
|
||||
|
||||
| Event Name | Payload | Publishers | Subscribers |
|
||||
| --------------------------------- | --------------------------------------------------- | --------------------------------- | ----------------------------------- |
|
||||
| `diagnostics.session.created` | `{ sessionId, productId, targetUserId, createdBy }` | diagnostics module | notifications → email/push user |
|
||||
| `diagnostics.session.started` | `{ sessionId, productId, startedAt }` | diagnostics module | — |
|
||||
| `diagnostics.session.updated` | `{ sessionId, productId, changes, updatedBy }` | diagnostics module | audit log |
|
||||
| `diagnostics.session.cancelled` | `{ sessionId, productId, reason, cancelledBy }` | diagnostics module | notifications → admin |
|
||||
| `diagnostics.session.completed` | `{ sessionId, productId, stats, endedAt }` | diagnostics module | notifications → admin summary email |
|
||||
| `diagnostics.session.expired` | `{ sessionId, productId, expiredAt }` | diagnostics module (TTL job) | — |
|
||||
| `diagnostics.ingest.fatal` | `{ sessionId, productId, logEntry, timestamp }` | diagnostics module (on fatal log) | PagerDuty/Slack alert |
|
||||
| `diagnostics.screenshot.captured` | `{ sessionId, productId, screenshotId, trigger }` | diagnostics module | — |
|
||||
|
||||
---
|
||||
|
||||
## Current Status
|
||||
|
||||
- [x] **Design complete** — 2026-03-02
|
||||
- [x] **Review complete** — 10 bugs/gaps identified and fixed:
|
||||
1. Fixed partition keys to avoid hot partitions (composite pk for traces/logs)
|
||||
2. Added `pk` field to all data models matching existing telemetry pattern
|
||||
3. Added `updatedAt`/`updatedBy` for audit trail completeness
|
||||
4. Added `userConsent` field for GDPR/privacy compliance
|
||||
5. Fixed screenshot storage to use Azure Blob (not Cosmos)
|
||||
6. Added PII redaction patterns and metadata tracking
|
||||
7. Added event bus integration with 8 specific events
|
||||
8. Added rate limiting specs for all endpoints
|
||||
9. Added ETag caching for config polling
|
||||
10. Added `targetSessionId` for capturing specific app sessions
|
||||
- [ ] Phase 1: Server Foundation (38 tests target)
|
||||
- [ ] Phase 2: Client SDKs (TS/Swift/Kotlin)
|
||||
- [ ] Phase 3: Admin UI
|
||||
- [ ] Phase 4: Advanced Features
|
||||
|
||||
**Total Tasks:** 140+ checkboxes across 4 phases
|
||||
|
||||
**Last Updated:** 2026-03-02
|
||||
Loading…
Reference in New Issue
Block a user