docs(roadmaps): move Remote Diagnostics roadmap to completed

This commit is contained in:
saravanakumardb1 2026-03-03 09:51:21 -08:00
parent cb3aa640ae
commit d95c25b0e4

View File

@ -0,0 +1,582 @@
# Remote Diagnostics & Debug Tracing — Implementation Roadmap
> **Module:** `platform-service/src/modules/diagnostics/`
> **Client SDK:** `@bytelyst/diagnostics`
> **Target:** E2E debug collection from any device, on-demand triggers, industry-parity features
> **Estimated Effort:** 23 weeks
> **Status:** 🟡 Planning
---
## Executive Summary
This roadmap delivers a **Datadog/Sentry-grade remote diagnostics system** for the ByteLyst ecosystem. Unlike passive telemetry (which we have), this enables **active debugging sessions** where engineers can remotely trigger deep diagnostics collection from any user device.
### Key Differentiators vs. Existing Telemetry
| Feature | Existing Telemetry | Remote Diagnostics |
| --------------- | ------------------------- | ------------------------------------- |
| Trigger | Passive (always sampling) | **Active** (engineer-initiated) |
| Log Level | Static config | **Dynamic** (debug/trace per session) |
| Network Tracing | None | **Full HTTP capture** |
| Breadcrumbs | Basic events | **Rich timeline** (user journey) |
| Console Logs | Error-only | **Full capture** (debug→fatal) |
| Screenshots | None | **Auto-capture on crash** |
| Session Replay | None | **Future: video-style replay** |
---
## Phase 1: Server Foundation (Week 1)
### 1.1 Data Model & Schemas
- [x] **1.1.1** Create `modules/diagnostics/types.ts` — [`f51c352`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/f51c352)
- [x] `DebugSessionDoc` — session metadata (status, target, config)
- [x] `DebugTraceDoc` — trace spans with timing
- [x] `DebugLogEntryDoc` — structured log entries
- [x] `DebugScreenshotDoc` — metadata for blob storage
- [x] Zod schemas for all inputs
- [x] **1.1.2** Add Cosmos containers to `cosmos-init.ts` — [`dea1521`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/dea1521)
- [x] `debug_sessions` (pk: `/id`, TTL: 7 days)
- [x] `debug_traces` (pk: `/pk` with composite `${productId}:${sessionId}`, TTL: 7 days)
- [x] `debug_logs` (pk: `/pk` with composite `${productId}:${sessionId}`, TTL: 3 days)
- [x] `debug_screenshots` metadata (pk: `/sessionId`) — actual images stored in Azure Blob
### 1.2 Repository Layer
- [x] **1.2.1** Create `modules/diagnostics/repository.ts` — [`f272a44`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/f272a44)
- [x] Use `@bytelyst/datastore` `getCollection()` pattern (see `telemetry/repository.ts`)
- [x] `createSession()` — initiate debug session, emit `diagnostics.session.created` event
- [x] `getSession()` — fetch session by ID (cross-partition query via `/id` pk)
- [x] `getSessionForIngest()` — optimized lookup for client ingest (query by `sessionId` field)
- [x] `updateSession()` — status changes, emit `diagnostics.session.updated` event
- [x] `listSessions()` — query by `productId` field with pagination
- [x] `deleteSession()` — manual cleanup, emit `diagnostics.session.deleted` event
- [x] `ingestTrace()` — batch upsert traces (use `upsert()` for idempotency)
- [x] `ingestLogs()` — batch upsert logs with PII scan (reuse `telemetry` PII patterns)
- [x] `getTraces()` — query by composite pk prefix `${productId}:${sessionId}`
- [x] `getLogs()` — query by composite pk with level filters
- [x] `updateSessionStats()` — denormalize logCount/traceCount atomically
### 1.3 REST API Routes
- [x] **1.3.1** Create `modules/diagnostics/routes.ts` — [`a66a689`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/a66a689)
- [x] Apply `requireRole('admin')` for all session management routes
- [x] Apply rate limiting: 10 session creates per admin per hour (prevent abuse)
- [x] `POST /diagnostics/sessions` — create session (admin only)
- [x] `GET /diagnostics/sessions` — list sessions (admin only)
- [x] `GET /diagnostics/sessions/:id` — get session details (admin or session owner)
- [x] `PATCH /diagnostics/sessions/:id` — update session (admin only)
- [x] `DELETE /diagnostics/sessions/:id` — cancel session (admin only)
- [x] `GET /diagnostics/config` — client polling endpoint (any authenticated user)
- [x] `POST /diagnostics/ingest` — batch trace/log ingestion (any authenticated user)
- [x] `POST /diagnostics/sessions/:id/traces` — ingest trace spans
- [x] `POST /diagnostics/sessions/:id/logs` — ingest log entries
- [x] `POST /diagnostics/sessions/:id/screenshots` — get SAS URL for screenshot upload
- [x] `GET /diagnostics/sessions/:id/traces` — query traces for session
- [x] `GET /diagnostics/sessions/:id/logs` — query logs with filters
- [x] `GET /diagnostics/sessions/:id/screenshots` — list screenshot metadata
### 1.4 Testing
- [x] **1.4.1** Create `modules/diagnostics/diagnostics.test.ts` — [`fb71981`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/fb71981)
- [x] Session CRUD tests (6 tests implemented, 4 pending)
- [x] Trace ingestion tests (2 tests implemented, 6 pending)
- [x] Log ingestion tests (3 tests implemented, 5 pending)
- [x] Schema validation tests (5 tests)
- [ ] Config polling tests (6 tests) — PENDING Phase 1.5
- [ ] Screenshot tests (6 tests) — PENDING Phase 1.5
- [x] **Target:** 14+ tests implemented (38 target for full Phase 1)
### 1.5 Integration
- [x] **1.5.1** Wire into `server.ts` — [`d444a8d`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/d444a8d)
- [x] Import `diagnosticsRoutes` from `./modules/diagnostics/routes.js`
- [x] Import `registerDiagnosticsSubscribers` from `./modules/diagnostics/subscribers.js`
- [x] Register: `await app.register(diagnosticsRoutes, { prefix: '/api' })`
- [x] Register: `registerDiagnosticsSubscribers(app.log)` at startup
- [x] Add after telemetry routes (logical grouping)
- [x] **1.5.2** Event Bus Integration (`lib/event-bus.ts`) — [`30583a1`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/30583a1)
- [x] Subscribers registered for all 8 diagnostics events
- [x] Email templates added (session-created, cancelled, completed, fatal-alert)
- [ ] Send notification to target user (email/push) — pending user lookup
- [ ] Notify admin who started session — pending admin lookup
- [ ] Alert on-call engineer (PagerDuty/Slack) — future integration
- [x] **1.5.3** Audit Logging (`modules/audit/`) — [`30583a1`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/30583a1)
- [x] Log all session lifecycle events (create, started, updated, cancel, completed, expired)
- [x] Log fatal log ingest and screenshot capture
- [x] Include target user ID, admin ID, session config in audit trail
- [x] Retention: 90 days via `audit_log` container TTL
- [x] **1.5.4** Rate Limiting Registration — [`30583a1`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/30583a1)
- [x] Add `diagnostics:session:create` rate limit key (10/hour per admin)
- [x] Add `diagnostics:config:poll` rate limit key (1/5sec per device)
- [x] Add `diagnostics:ingest:submit` rate limit key (100/min per device)
**Phase 1 Exit Criteria:**
- [x] All routes return 200 with correct payloads
- [x] 17 tests passing (diagnostics module) / 839 total platform-service tests
- [x] Event bus subscribers registered and tested
- [x] Audit logs written for all session operations
- [x] Rate limiting enforced
- [x] PII redaction working in log ingest
- [x] Admin can create session via API
- [ ] 38+ tests target (deferred: config polling, screenshot tests — Phase 2)
---
## Phase 2: Client SDK Abstractions (Week 12)
### 2.1 TypeScript Client SDK — COMPLETE [`8acb8db`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/8acb8db)
- [x] **2.1.1** Create `@bytelyst/diagnostics-client` package
- [x] `package.json` with ESM exports
- [x] `tsconfig.json` extending base
- [x] **2.1.2** Core types (`src/types.ts`)
- [x] `DiagnosticsSession` interface
- [x] `TraceSpan` interface
- [x] `LogLevel` type (debug, info, warn, error, fatal)
- [x] `DiagnosticsConfig` from server
- [x] **2.1.3** Main client (`src/client.ts`)
- [x] `DiagnosticsClient` class (singleton)
- [x] `start()` — begin polling for active sessions
- [x] `stop()` — cease polling
- [x] `isSessionActive()` — check current state
- [x] `trace(name, fn)` — auto-instrumented span wrapper
- [x] `log(level, message, context)` — structured logging
- [x] `breadcrumb(category, message, data)` — add timeline marker
- [x] **2.1.4** Network interceptor (`src/network.ts`)
- [x] `NetworkInterceptor` class
- [x] Wrap `fetch()` for capture
- [x] Capture: URL, method, headers (sanitized), timing, status
- [x] Configurable URL patterns (include/exclude)
- [x] **2.1.5** Breadcrumb trail (`src/breadcrumbs.ts`)
- [x] Ring buffer (max 100 entries)
- [x] Manual: `breadcrumb()` API
- [x] **2.1.6** Device state (`src/device.ts`)
- [x] Memory, battery, storage, network type
- [x] **2.1.7** Screenshot capture — deferred to Phase 2.2+
- [x] **2.1.8** Tests (`src/__tests__/client.test.ts`)
- [x] 21 Vitest tests passing
### 2.2 Swift Client SDK (iOS) — COMPLETE [`abcf817`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/abcf817)
- [x] **2.2.1** Create `ByteLystDiagnostics` Swift package
- [x] `Package.swift` with iOS 15+ target
- [x] Module structure: Core, Network, Device
- [x] **2.2.2** Core client (`Sources/Core/DiagnosticsClient.swift`)
- [x] `DiagnosticsClient` actor (thread-safe)
- [x] `start()` — polling with `Timer`
- [x] `trace<T>(name, operation)` — async span wrapper
- [x] `log(level, message, metadata)` — structured logging
- [x] `breadcrumb(category, message)` — timeline
- [x] **2.2.3** Network interception (`Sources/Network/NetworkInterceptor.swift`)
- [x] `URLProtocol` subclass for automatic capture
- [x] Capture: request/response, timing, bytes
- [x] **2.2.4** Device state (`Sources/Device/DeviceState.swift`)
- [x] `UIDevice` integration (battery, thermal)
- [x] `ProcessInfo` (memory pressure)
- [x] `NetworkMonitor` (path status)
- [x] **2.2.5** Screenshot — deferred to Phase 4
- [x] **2.2.6** Tests (`Tests/DiagnosticsClientTests.swift`)
- [x] 20+ XCTest unit tests
### 2.3 Kotlin Client SDK (Android) — COMPLETE [`fc8f8d3`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/fc8f8d3)
- [x] **2.3.1** Create `diagnostics` module in `kotlin-platform-sdk`
- [x] Module structure with Coroutines + OkHttp
- [x] **2.3.2** Core client (`diagnostics/DiagnosticsClient.kt`)
- [x] Singleton with `StateFlow<DiagnosticsState>`
- [x] `start()` — polling with coroutines
- [x] `trace()` — suspend function with span
- [x] `log()` — structured queue
- [x] **2.3.3** Network interceptor (`diagnostics/NetworkInterceptor.kt`)
- [x] `Interceptor` implementation
- [x] Capture: request/response chain
- [x] **2.3.4** Device state (`diagnostics/DeviceStateCollector.kt`)
- [x] `BatteryManager`, `ActivityManager`, `StorageStatsManager`
- [x] **2.3.5** Screenshot — deferred to Phase 4
- [x] **2.3.6** Tests (`diagnostics/DiagnosticsTypesTest.kt`)
- [x] 16+ JUnit tests
**Phase 2 Exit Criteria:**
- [x] TS SDK builds + 20 tests passing
- [x] Swift SDK builds + 20 tests passing
- [x] Kotlin SDK builds + 16 tests passing
- [x] All SDKs can poll config endpoint
---
## Phase 3: Admin Dashboard UI (Week 2)
### 3.1 Debug Sessions Page — COMPLETE [`2e697a1`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/2e697a1)
- [x] **3.1.1** Create `/ops/debug-sessions/page.tsx`
- [x] Session list table (columns: ID, user, device, status, started, duration)
- [x] Filters: status, product, date range
- [x] Auto-refresh every 5 seconds
- [x] "New Session" button → modal
- [x] **3.1.2** New Session Modal
- [x] Target user (email/userId)
- [x] Target device (input)
- [x] Collection level (standard, debug, trace)
- [x] Duration slider (5min → 24hr)
- [x] Screenshot on error (toggle)
- [x] "Start Session" → POST API
### 3.2 Session Detail View — COMPLETE [`e2e5e2c`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/e2e5e2c)
- [x] **3.2.1** Create `/ops/debug-sessions/[id]/page.tsx`
- [x] Session header (status badge, user info, device info)
- [x] Action buttons: Extend (+30min), Stop, Download
- [x] Tabs: Timeline, Logs, Network, Traces, Screenshots
- [x] **3.2.2** Timeline Tab
- [x] Breadcrumb list (time, category, message)
- [x] Visual timeline with connector lines
- [x] **3.2.3** Logs Tab
- [x] Log level filters (debug, info, warn, error, fatal)
- [x] Color-coded log levels
- [x] Module and timestamp display
- [x] **3.2.4** Network Tab
- [x] Request list (time, method, URL, status, duration)
- [x] Status badge coloring
- [x] **3.2.5** Traces Tab
- [x] Trace list with name, status, duration
- [x] Status badge coloring
- [x] **3.2.6** Screenshots Tab
- [x] Placeholder for screenshot grid
- [x] Empty state messaging
### 3.3 Real-time Updates
- [ ] **3.3.1** Server-sent events or polling
- [ ] Auto-refresh session status every 5 seconds
- [ ] Toast notification on new logs/traces
### 3.4 Client Library — COMPLETE
- [x] **3.4.1** Create `lib/diagnostics-client.ts`
- [x] `querySessions()`
- [x] `createSession()`
- [x] `getSession()`
- [x] `updateSession()`
- [x] `getTraces()`
- [x] `getLogs()`
**Phase 3 Exit Criteria:**
- [x] Admin can create session from UI
- [x] Session detail shows live data
- [x] All 5 tabs functional (Timeline, Logs, Network, Traces, Screenshots)
---
## Phase 4: Advanced Features (Week 3)
### 4.1 Automated Triggers
- [ ] **4.1.1** Error-threshold triggers
- [ ] Config: "Start debug session if error rate > X%"
- [ ] Background job checks every 5 minutes
- [ ] Auto-notify on Slack/Teams
- [ ] **4.1.2** Crash-triggered sessions
- [ ] Client sends crash → server auto-starts session
- [ ] Captures 60 seconds pre-crash context
### 4.2 Session Replay (Future)
- [ ] **4.2.1** DOM/View state capture
- [ ] Record user interactions (clicks, scrolls, inputs)
- [ ] Replay as video-like timeline
- [ ] Privacy: exclude password fields
### 4.3 Performance Profiling
- [ ] **4.3.1** CPU/Memory profiling
- [ ] iOS: `os_signpost` integration
- [ ] Android: `Debug.MemoryInfo`
- [ ] Web: `performance.now()` + memory API
### 4.4 Integration Tests
- [ ] **4.4.1** E2E test: Admin creates session → Client captures → Admin views
- [ ] Playwright test (web client)
- [ ] XCTest UI test (iOS)
- [ ] Espresso test (Android)
**Phase 4 Exit Criteria:**
- [ ] Auto-trigger tests passing
- [ ] E2E flow working end-to-end
---
## Appendix A: Data Models
### DebugSessionDoc
```typescript
interface DebugSessionDoc {
id: string; // ds_<uuid> — also the partition key (/id)
productId: string; // For filtering/querying (not pk to avoid hot partitions)
// Target (at least one required)
targetUserId?: string; // For authenticated users
targetAnonymousId?: string; // For anonymous users (installId)
targetDeviceId?: string; // Specific device fingerprint
targetSessionId?: string; // Specific app session to capture
// Status lifecycle: pending → active → paused → completed | cancelled
status: 'pending' | 'active' | 'paused' | 'completed' | 'cancelled';
// Collection configuration
collectionLevel: 'standard' | 'debug' | 'trace';
captureLogs: boolean;
captureNetwork: boolean;
captureScreenshots: boolean;
screenshotOnError: boolean;
maxDurationMinutes: number; // Default: 60, Max: 1440 (24h)
// Timestamps
createdAt: string; // ISO 8601
updatedAt: string; // Last status/config change
startedAt?: string; // When status became 'active'
endedAt?: string; // When status became 'completed'|'cancelled'
expiresAt: string; // Auto-expiry (createdAt + maxDurationMinutes)
// Stats (denormalized for fast reads, updated via ingest)
logCount: number;
traceCount: number;
screenshotCount: number;
// Audit
createdBy: string; // Admin userId who created session
updatedBy?: string; // Last admin to modify
// Consent tracking (privacy compliance)
userConsent?: {
consentedAt: string;
consentMethod: 'prompt' | 'pre_consent' | 'auto'; // How user agreed
};
}
```
### DebugTraceDoc (OpenTelemetry-compatible)
```typescript
interface DebugTraceDoc {
id: string; // tr_<uuid>
pk: string; // Composite: `${productId}:${sessionId}` — partition key
sessionId: string; // For queries (also part of pk)
productId: string; // For filtering
// OpenTelemetry trace context
traceId: string; // OTel trace ID (hex)
parentId?: string; // Parent span ID (null for root)
spanId: string; // This span's ID
name: string; // Operation name (e.g., "UserLogin", "API.fetchUser")
kind?: 'internal' | 'server' | 'client' | 'producer' | 'consumer';
// Timing (nanosecond precision for OTel compatibility)
startTime: string; // ISO 8601
endTime?: string;
durationMs?: number;
// Context and attributes
attributes: Record<string, unknown>; // Custom key-value pairs
status: 'ok' | 'error' | 'unset';
statusMessage?: string; // Error description if status=error
// Events (spans within span — e.g., "db.query", "cache.hit")
events?: Array<{
name: string;
timestamp: string;
attributes?: Record<string, unknown>;
}>;
// Links to other traces (for async operations)
links?: Array<{
traceId: string;
spanId: string;
attributes?: Record<string, unknown>;
}>;
}
```
````
### DebugLogEntryDoc
```typescript
interface DebugLogEntryDoc {
id: string; // log_<uuid>
pk: string; // Composite: `${productId}:${sessionId}` — partition key
sessionId: string; // For queries (also part of pk)
productId: string; // For filtering
// Log level (matches syslog/OTel severity)
level: 'debug' | 'info' | 'warn' | 'error' | 'fatal';
message: string; // Original message (PII redacted server-side)
messageHash?: string; // SHA-256 for deduplication
// Timestamp (client clock, server enriches with receivedAt)
timestamp: string; // ISO 8601 — when log was generated
receivedAt?: string; // Server-side ingestion time
// Source context
module: string; // Component/module name (e.g., "AudioEngine", "SyncManager")
file?: string; // Source file path (sanitized)
line?: number; // Line number
function?: string; // Function/method name
// Thread/task context
threadId?: string; // For multi-threaded apps
correlationId?: string; // Links related operations
// Arbitrary context (PII scanned and redacted)
context: Record<string, unknown>;
// PII redaction metadata
redaction?: {
fieldsRedacted: string[]; // Which fields were scrubbed
patternsMatched: string[]; // Which PII patterns found (email, ssn, etc.)
};
}
````
### DebugScreenshotDoc (Metadata only — image in Blob)
```typescript
interface DebugScreenshotDoc {
id: string; // scr_<uuid>
sessionId: string; // Partition key for queries
productId: string;
// Storage reference (actual image in Azure Blob)
blobUrl: string; // SAS URL to blob (time-limited)
blobPath: string; // Path in container: `screenshots/${productId}/${sessionId}/${id}.png`
containerName: string; // Azure Blob container (e.g., "diagnostics-screenshots")
// Screenshot metadata
capturedAt: string; // When captured
trigger: 'manual' | 'error' | 'interval' | 'user_request'; // Why taken
// Dimensions & format
width: number;
height: number;
format: 'png' | 'jpeg' | 'webp';
sizeBytes: number;
// Privacy
sensitiveViewsBlurred: boolean; // Whether PII areas were blurred
blurRegions?: Array<{ x: number; y: number; w: number; h: number }>; // If partial blur
// Optional context
screenName?: string; // Current screen/view when captured
breadcrumbAtCapture?: string; // Last breadcrumb before screenshot
}
```
---
## Appendix B: API Reference
| Method | Endpoint | Auth | Rate Limit | Description |
| ------ | ------------------------------------------- | ----------- | -------------- | --------------------------------------- |
| POST | `/api/diagnostics/sessions` | Admin | 10/hour/admin | Create debug session |
| GET | `/api/diagnostics/sessions` | Admin | 100/min | List sessions (paginated) |
| GET | `/api/diagnostics/sessions/:id` | Admin/Owner | 100/min | Get session details |
| PATCH | `/api/diagnostics/sessions/:id` | Admin | 10/min | Update session status |
| DELETE | `/api/diagnostics/sessions/:id` | Admin | 10/min | Cancel session (soft delete) |
| GET | `/api/diagnostics/config` | Any auth | 1/5sec/device | Poll for active session (ETag cached) |
| POST | `/api/diagnostics/ingest` | Any auth | 100/min/device | Submit traces/logs batch (max 50 items) |
| POST | `/api/diagnostics/sessions/:id/traces` | Any auth | 100/min/device | Ingest trace spans |
| POST | `/api/diagnostics/sessions/:id/logs` | Any auth | 100/min/device | Ingest log entries |
| POST | `/api/diagnostics/sessions/:id/screenshots` | Any auth | 10/min/device | Get SAS URL for screenshot upload |
| GET | `/api/diagnostics/sessions/:id/traces` | Admin | 100/min | Query traces for session |
| GET | `/api/diagnostics/sessions/:id/logs` | Admin | 100/min | Query logs with filters |
| GET | `/api/diagnostics/sessions/:id/screenshots` | Admin | 100/min | List screenshot metadata |
---
## Appendix C: Industry Comparison
| Capability | Firebase Crashlytics | Sentry | Datadog RUM | Our Solution |
| --------------- | -------------------- | -------- | ----------- | ------------------ |
| Crash Reporting | ✅ | ✅ | ✅ | ✅ (via telemetry) |
| Error Tracking | ✅ | ✅ | ✅ | ✅ (via telemetry) |
| Breadcrumbs | ✅ | ✅ | ✅ | ✅ |
| Custom Traces | ⚠️ Limited | ✅ | ✅ | ✅ |
| Network Tracing | ❌ | ✅ | ✅ | ✅ |
| Console Logs | ⚠️ Error only | ✅ | ✅ | ✅ (all levels) |
| Session Replay | ❌ | ✅ | ✅ | 🟡 Future |
| Remote Trigger | ❌ | ✅ | ❌ | ✅ |
| On-Device Debug | ❌ | ❌ | ❌ | ✅ |
| Screenshots | ⚠️ Crash only | ✅ | ❌ | ✅ |
| Open Source | ❌ | ✅ (SDK) | ❌ | ✅ |
---
## Appendix D: Privacy & Security
### D.1 PII Redaction Patterns (server-side)
| Pattern | Regex | Redaction Method | Example |
| -------------- | ------------------------------------------------------ | ---------------------- | ------------------------------- | ------------------ | --------------------------- | ------------------------------------------------ |
| Email | `[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}` | Replace with `[EMAIL]` | `user@example.com``[EMAIL]` |
| SSN (US) | `\b\d{3}-?\d{2}-?\d{4}\b` | Replace with `[SSN]` | `123-45-6789``[SSN]` |
| Credit Card | `\b(?:\d[ -]*?){13,16}\b` | Replace with `[CC]` | `4111 1111 1111 1111``[CC]` |
| Phone | `\b\+?1?\s?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}\b` | Replace with `[PHONE]` | `+1 (555) 123-4567``[PHONE]` |
| IP Address | `\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b` | Replace with `[IP]` | `192.168.1.1``[IP]` |
| Password/Token | `(?i)(password | token | secret | key)\s*[:=]\s*\S+` | Replace with `[CREDENTIAL]` | `password: secret123``password: [CREDENTIAL]` |
| JWT | `eyJ[a-zA-Z0-9_-]*\.eyJ[a-zA-Z0-9_-]*\.[a-zA-Z0-9_-]*` | Replace with `[JWT]` | Full JWT → `[JWT]` |
- [ ] **1. PII Redaction:** Implement all patterns above in `lib/pii-redaction.ts` (shared with telemetry)
- [ ] **2. Redaction Metadata:** Store which patterns matched in `redaction.fieldsRedacted` for transparency
- [ ] **3. Consent Tracking:** `userConsent` field in session doc with `consentedAt` and `consentMethod`
- [ ] **4. Data Retention:** 7-day default TTL for sessions/traces, 3-day for logs (Cosmos TTL)
- [ ] **5. Access Control:** Admin-only session creation; users can only view their own sessions via `targetUserId` check
- [ ] **6. Encryption:** All data encrypted at rest (Cosmos), in transit (TLS 1.3)
- [ ] **7. Audit Trail:** All session operations logged to `audit_log` container (90-day retention)
- [ ] **8. User Notification:** Email/push notification when debug session started on their device
---
## Appendix E: Event Bus Events
| Event Name | Payload | Publishers | Subscribers |
| --------------------------------- | --------------------------------------------------- | --------------------------------- | ----------------------------------- |
| `diagnostics.session.created` | `{ sessionId, productId, targetUserId, createdBy }` | diagnostics module | notifications → email/push user |
| `diagnostics.session.started` | `{ sessionId, productId, startedAt }` | diagnostics module | — |
| `diagnostics.session.updated` | `{ sessionId, productId, changes, updatedBy }` | diagnostics module | audit log |
| `diagnostics.session.cancelled` | `{ sessionId, productId, reason, cancelledBy }` | diagnostics module | notifications → admin |
| `diagnostics.session.completed` | `{ sessionId, productId, stats, endedAt }` | diagnostics module | notifications → admin summary email |
| `diagnostics.session.expired` | `{ sessionId, productId, expiredAt }` | diagnostics module (TTL job) | — |
| `diagnostics.ingest.fatal` | `{ sessionId, productId, logEntry, timestamp }` | diagnostics module (on fatal log) | PagerDuty/Slack alert |
| `diagnostics.screenshot.captured` | `{ sessionId, productId, screenshotId, trigger }` | diagnostics module | — |
---
## Current Status
- [x] **Design complete** — 2026-03-02
- [x] **Review complete** — 10 bugs/gaps identified and fixed
- [x] **Phase 1: Server Foundation** — COMPLETE — 2026-03-03
- 17 diagnostics tests passing, 839 total platform-service tests
- Event bus subscribers registered for 8 events
- Audit logging for all session lifecycle events
- Rate limiting keys configured
- 4 email templates ready for notifications
- [x] **Phase 2: Client SDKs** — COMPLETE — 2026-03-03
- TypeScript SDK: 21 tests passing
- Swift SDK: 20+ tests, iOS 15+ support
- Kotlin SDK: 16+ tests, API 26+ support
- [x] **Phase 3: Admin UI** — COMPLETE — 2026-03-03
- Debug Sessions list page (3.1) with filters
- Session Detail view (3.2) with 5 tabs
- Real-time auto-refresh (5s polling)
- [ ] **Phase 4: Advanced Features** — Future
**Total Tasks:** 140+ checkboxes across 4 phases
**Last Updated:** 2026-03-03