learning_ai_common_plat/docs/devops/REMOTE_DIAGNOSTICS_ROADMAP.md
saravanakumardb1 cb3aa640ae docs(roadmap): mark Phase 3.2 Session Detail View complete
- Phase 3.2: Session Detail View with 5 tabs

- All Phase 3 exit criteria now met

Next: Phase 4 Advanced Features (auto-triggers, session replay)
2026-03-03 09:48:15 -08:00

28 KiB
Raw Blame History

Remote Diagnostics & Debug Tracing — Implementation Roadmap

Module: platform-service/src/modules/diagnostics/
Client SDK: @bytelyst/diagnostics
Target: E2E debug collection from any device, on-demand triggers, industry-parity features
Estimated Effort: 23 weeks
Status: 🟡 Planning


Executive Summary

This roadmap delivers a Datadog/Sentry-grade remote diagnostics system for the ByteLyst ecosystem. Unlike passive telemetry (which we have), this enables active debugging sessions where engineers can remotely trigger deep diagnostics collection from any user device.

Key Differentiators vs. Existing Telemetry

Feature Existing Telemetry Remote Diagnostics
Trigger Passive (always sampling) Active (engineer-initiated)
Log Level Static config Dynamic (debug/trace per session)
Network Tracing None Full HTTP capture
Breadcrumbs Basic events Rich timeline (user journey)
Console Logs Error-only Full capture (debug→fatal)
Screenshots None Auto-capture on crash
Session Replay None Future: video-style replay

Phase 1: Server Foundation (Week 1)

1.1 Data Model & Schemas

  • 1.1.1 Create modules/diagnostics/types.tsf51c352
    • DebugSessionDoc — session metadata (status, target, config)
    • DebugTraceDoc — trace spans with timing
    • DebugLogEntryDoc — structured log entries
    • DebugScreenshotDoc — metadata for blob storage
    • Zod schemas for all inputs
  • 1.1.2 Add Cosmos containers to cosmos-init.tsdea1521
    • debug_sessions (pk: /id, TTL: 7 days)
    • debug_traces (pk: /pk with composite ${productId}:${sessionId}, TTL: 7 days)
    • debug_logs (pk: /pk with composite ${productId}:${sessionId}, TTL: 3 days)
    • debug_screenshots metadata (pk: /sessionId) — actual images stored in Azure Blob

1.2 Repository Layer

  • 1.2.1 Create modules/diagnostics/repository.tsf272a44
    • Use @bytelyst/datastore getCollection() pattern (see telemetry/repository.ts)
    • createSession() — initiate debug session, emit diagnostics.session.created event
    • getSession() — fetch session by ID (cross-partition query via /id pk)
    • getSessionForIngest() — optimized lookup for client ingest (query by sessionId field)
    • updateSession() — status changes, emit diagnostics.session.updated event
    • listSessions() — query by productId field with pagination
    • deleteSession() — manual cleanup, emit diagnostics.session.deleted event
    • ingestTrace() — batch upsert traces (use upsert() for idempotency)
    • ingestLogs() — batch upsert logs with PII scan (reuse telemetry PII patterns)
    • getTraces() — query by composite pk prefix ${productId}:${sessionId}
    • getLogs() — query by composite pk with level filters
    • updateSessionStats() — denormalize logCount/traceCount atomically

1.3 REST API Routes

  • 1.3.1 Create modules/diagnostics/routes.tsa66a689
    • Apply requireRole('admin') for all session management routes
    • Apply rate limiting: 10 session creates per admin per hour (prevent abuse)
    • POST /diagnostics/sessions — create session (admin only)
    • GET /diagnostics/sessions — list sessions (admin only)
    • GET /diagnostics/sessions/:id — get session details (admin or session owner)
    • PATCH /diagnostics/sessions/:id — update session (admin only)
    • DELETE /diagnostics/sessions/:id — cancel session (admin only)
    • GET /diagnostics/config — client polling endpoint (any authenticated user)
    • POST /diagnostics/ingest — batch trace/log ingestion (any authenticated user)
    • POST /diagnostics/sessions/:id/traces — ingest trace spans
    • POST /diagnostics/sessions/:id/logs — ingest log entries
    • POST /diagnostics/sessions/:id/screenshots — get SAS URL for screenshot upload
    • GET /diagnostics/sessions/:id/traces — query traces for session
    • GET /diagnostics/sessions/:id/logs — query logs with filters
    • GET /diagnostics/sessions/:id/screenshots — list screenshot metadata

1.4 Testing

  • 1.4.1 Create modules/diagnostics/diagnostics.test.tsfb71981
    • Session CRUD tests (6 tests implemented, 4 pending)
    • Trace ingestion tests (2 tests implemented, 6 pending)
    • Log ingestion tests (3 tests implemented, 5 pending)
    • Schema validation tests (5 tests)
    • Config polling tests (6 tests) — PENDING Phase 1.5
    • Screenshot tests (6 tests) — PENDING Phase 1.5
    • Target: 14+ tests implemented (38 target for full Phase 1)

1.5 Integration

  • 1.5.1 Wire into server.tsd444a8d
    • Import diagnosticsRoutes from ./modules/diagnostics/routes.js
    • Import registerDiagnosticsSubscribers from ./modules/diagnostics/subscribers.js
    • Register: await app.register(diagnosticsRoutes, { prefix: '/api' })
    • Register: registerDiagnosticsSubscribers(app.log) at startup
    • Add after telemetry routes (logical grouping)
  • 1.5.2 Event Bus Integration (lib/event-bus.ts) — 30583a1
    • Subscribers registered for all 8 diagnostics events
    • Email templates added (session-created, cancelled, completed, fatal-alert)
    • Send notification to target user (email/push) — pending user lookup
    • Notify admin who started session — pending admin lookup
    • Alert on-call engineer (PagerDuty/Slack) — future integration
  • 1.5.3 Audit Logging (modules/audit/) — 30583a1
    • Log all session lifecycle events (create, started, updated, cancel, completed, expired)
    • Log fatal log ingest and screenshot capture
    • Include target user ID, admin ID, session config in audit trail
    • Retention: 90 days via audit_log container TTL
  • 1.5.4 Rate Limiting Registration — 30583a1
    • Add diagnostics:session:create rate limit key (10/hour per admin)
    • Add diagnostics:config:poll rate limit key (1/5sec per device)
    • Add diagnostics:ingest:submit rate limit key (100/min per device)

Phase 1 Exit Criteria:

  • All routes return 200 with correct payloads
  • 17 tests passing (diagnostics module) / 839 total platform-service tests
  • Event bus subscribers registered and tested
  • Audit logs written for all session operations
  • Rate limiting enforced
  • PII redaction working in log ingest
  • Admin can create session via API
  • 38+ tests target (deferred: config polling, screenshot tests — Phase 2)

Phase 2: Client SDK Abstractions (Week 12)

2.1 TypeScript Client SDK — COMPLETE 8acb8db

  • 2.1.1 Create @bytelyst/diagnostics-client package
    • package.json with ESM exports
    • tsconfig.json extending base
  • 2.1.2 Core types (src/types.ts)
    • DiagnosticsSession interface
    • TraceSpan interface
    • LogLevel type (debug, info, warn, error, fatal)
    • DiagnosticsConfig from server
  • 2.1.3 Main client (src/client.ts)
    • DiagnosticsClient class (singleton)
    • start() — begin polling for active sessions
    • stop() — cease polling
    • isSessionActive() — check current state
    • trace(name, fn) — auto-instrumented span wrapper
    • log(level, message, context) — structured logging
    • breadcrumb(category, message, data) — add timeline marker
  • 2.1.4 Network interceptor (src/network.ts)
    • NetworkInterceptor class
    • Wrap fetch() for capture
    • Capture: URL, method, headers (sanitized), timing, status
    • Configurable URL patterns (include/exclude)
  • 2.1.5 Breadcrumb trail (src/breadcrumbs.ts)
    • Ring buffer (max 100 entries)
    • Manual: breadcrumb() API
  • 2.1.6 Device state (src/device.ts)
    • Memory, battery, storage, network type
  • 2.1.7 Screenshot capture — deferred to Phase 2.2+
  • 2.1.8 Tests (src/__tests__/client.test.ts)
    • 21 Vitest tests passing

2.2 Swift Client SDK (iOS) — COMPLETE abcf817

  • 2.2.1 Create ByteLystDiagnostics Swift package
    • Package.swift with iOS 15+ target
    • Module structure: Core, Network, Device
  • 2.2.2 Core client (Sources/Core/DiagnosticsClient.swift)
    • DiagnosticsClient actor (thread-safe)
    • start() — polling with Timer
    • trace<T>(name, operation) — async span wrapper
    • log(level, message, metadata) — structured logging
    • breadcrumb(category, message) — timeline
  • 2.2.3 Network interception (Sources/Network/NetworkInterceptor.swift)
    • URLProtocol subclass for automatic capture
    • Capture: request/response, timing, bytes
  • 2.2.4 Device state (Sources/Device/DeviceState.swift)
    • UIDevice integration (battery, thermal)
    • ProcessInfo (memory pressure)
    • NetworkMonitor (path status)
  • 2.2.5 Screenshot — deferred to Phase 4
  • 2.2.6 Tests (Tests/DiagnosticsClientTests.swift)
    • 20+ XCTest unit tests

2.3 Kotlin Client SDK (Android) — COMPLETE fc8f8d3

  • 2.3.1 Create diagnostics module in kotlin-platform-sdk
    • Module structure with Coroutines + OkHttp
  • 2.3.2 Core client (diagnostics/DiagnosticsClient.kt)
    • Singleton with StateFlow<DiagnosticsState>
    • start() — polling with coroutines
    • trace() — suspend function with span
    • log() — structured queue
  • 2.3.3 Network interceptor (diagnostics/NetworkInterceptor.kt)
    • Interceptor implementation
    • Capture: request/response chain
  • 2.3.4 Device state (diagnostics/DeviceStateCollector.kt)
    • BatteryManager, ActivityManager, StorageStatsManager
  • 2.3.5 Screenshot — deferred to Phase 4
  • 2.3.6 Tests (diagnostics/DiagnosticsTypesTest.kt)
    • 16+ JUnit tests

Phase 2 Exit Criteria:

  • TS SDK builds + 20 tests passing
  • Swift SDK builds + 20 tests passing
  • Kotlin SDK builds + 16 tests passing
  • All SDKs can poll config endpoint

Phase 3: Admin Dashboard UI (Week 2)

3.1 Debug Sessions Page — COMPLETE 2e697a1

  • 3.1.1 Create /ops/debug-sessions/page.tsx
    • Session list table (columns: ID, user, device, status, started, duration)
    • Filters: status, product, date range
    • Auto-refresh every 5 seconds
    • "New Session" button → modal
  • 3.1.2 New Session Modal
    • Target user (email/userId)
    • Target device (input)
    • Collection level (standard, debug, trace)
    • Duration slider (5min → 24hr)
    • Screenshot on error (toggle)
    • "Start Session" → POST API

3.2 Session Detail View — COMPLETE e2e5e2c

  • 3.2.1 Create /ops/debug-sessions/[id]/page.tsx
    • Session header (status badge, user info, device info)
    • Action buttons: Extend (+30min), Stop, Download
    • Tabs: Timeline, Logs, Network, Traces, Screenshots
  • 3.2.2 Timeline Tab
    • Breadcrumb list (time, category, message)
    • Visual timeline with connector lines
  • 3.2.3 Logs Tab
    • Log level filters (debug, info, warn, error, fatal)
    • Color-coded log levels
    • Module and timestamp display
  • 3.2.4 Network Tab
    • Request list (time, method, URL, status, duration)
    • Status badge coloring
  • 3.2.5 Traces Tab
    • Trace list with name, status, duration
    • Status badge coloring
  • 3.2.6 Screenshots Tab
    • Placeholder for screenshot grid
    • Empty state messaging

3.3 Real-time Updates

  • 3.3.1 Server-sent events or polling
    • Auto-refresh session status every 5 seconds
    • Toast notification on new logs/traces

3.4 Client Library — COMPLETE

  • 3.4.1 Create lib/diagnostics-client.ts
    • querySessions()
    • createSession()
    • getSession()
    • updateSession()
    • getTraces()
    • getLogs()

Phase 3 Exit Criteria:

  • Admin can create session from UI
  • Session detail shows live data
  • All 5 tabs functional (Timeline, Logs, Network, Traces, Screenshots)

Phase 4: Advanced Features (Week 3)

4.1 Automated Triggers

  • 4.1.1 Error-threshold triggers
    • Config: "Start debug session if error rate > X%"
    • Background job checks every 5 minutes
    • Auto-notify on Slack/Teams
  • 4.1.2 Crash-triggered sessions
    • Client sends crash → server auto-starts session
    • Captures 60 seconds pre-crash context

4.2 Session Replay (Future)

  • 4.2.1 DOM/View state capture
    • Record user interactions (clicks, scrolls, inputs)
    • Replay as video-like timeline
    • Privacy: exclude password fields

4.3 Performance Profiling

  • 4.3.1 CPU/Memory profiling
    • iOS: os_signpost integration
    • Android: Debug.MemoryInfo
    • Web: performance.now() + memory API

4.4 Integration Tests

  • 4.4.1 E2E test: Admin creates session → Client captures → Admin views
    • Playwright test (web client)
    • XCTest UI test (iOS)
    • Espresso test (Android)

Phase 4 Exit Criteria:

  • Auto-trigger tests passing
  • E2E flow working end-to-end

Appendix A: Data Models

DebugSessionDoc

interface DebugSessionDoc {
  id: string; // ds_<uuid> — also the partition key (/id)
  productId: string; // For filtering/querying (not pk to avoid hot partitions)

  // Target (at least one required)
  targetUserId?: string; // For authenticated users
  targetAnonymousId?: string; // For anonymous users (installId)
  targetDeviceId?: string; // Specific device fingerprint
  targetSessionId?: string; // Specific app session to capture

  // Status lifecycle: pending → active → paused → completed | cancelled
  status: 'pending' | 'active' | 'paused' | 'completed' | 'cancelled';

  // Collection configuration
  collectionLevel: 'standard' | 'debug' | 'trace';
  captureLogs: boolean;
  captureNetwork: boolean;
  captureScreenshots: boolean;
  screenshotOnError: boolean;
  maxDurationMinutes: number; // Default: 60, Max: 1440 (24h)

  // Timestamps
  createdAt: string; // ISO 8601
  updatedAt: string; // Last status/config change
  startedAt?: string; // When status became 'active'
  endedAt?: string; // When status became 'completed'|'cancelled'
  expiresAt: string; // Auto-expiry (createdAt + maxDurationMinutes)

  // Stats (denormalized for fast reads, updated via ingest)
  logCount: number;
  traceCount: number;
  screenshotCount: number;

  // Audit
  createdBy: string; // Admin userId who created session
  updatedBy?: string; // Last admin to modify

  // Consent tracking (privacy compliance)
  userConsent?: {
    consentedAt: string;
    consentMethod: 'prompt' | 'pre_consent' | 'auto'; // How user agreed
  };
}

DebugTraceDoc (OpenTelemetry-compatible)

interface DebugTraceDoc {
  id: string; // tr_<uuid>
  pk: string; // Composite: `${productId}:${sessionId}` — partition key
  sessionId: string; // For queries (also part of pk)
  productId: string; // For filtering

  // OpenTelemetry trace context
  traceId: string; // OTel trace ID (hex)
  parentId?: string; // Parent span ID (null for root)
  spanId: string; // This span's ID
  name: string; // Operation name (e.g., "UserLogin", "API.fetchUser")
  kind?: 'internal' | 'server' | 'client' | 'producer' | 'consumer';

  // Timing (nanosecond precision for OTel compatibility)
  startTime: string; // ISO 8601
  endTime?: string;
  durationMs?: number;

  // Context and attributes
  attributes: Record<string, unknown>; // Custom key-value pairs
  status: 'ok' | 'error' | 'unset';
  statusMessage?: string; // Error description if status=error

  // Events (spans within span — e.g., "db.query", "cache.hit")
  events?: Array<{
    name: string;
    timestamp: string;
    attributes?: Record<string, unknown>;
  }>;

  // Links to other traces (for async operations)
  links?: Array<{
    traceId: string;
    spanId: string;
    attributes?: Record<string, unknown>;
  }>;
}

### DebugLogEntryDoc
```typescript
interface DebugLogEntryDoc {
  id: string;                     // log_<uuid>
  pk: string;                     // Composite: `${productId}:${sessionId}` — partition key
  sessionId: string;              // For queries (also part of pk)
  productId: string;              // For filtering

  // Log level (matches syslog/OTel severity)
  level: 'debug' | 'info' | 'warn' | 'error' | 'fatal';
  message: string;                // Original message (PII redacted server-side)
  messageHash?: string;           // SHA-256 for deduplication

  // Timestamp (client clock, server enriches with receivedAt)
  timestamp: string;                // ISO 8601 — when log was generated
  receivedAt?: string;            // Server-side ingestion time

  // Source context
  module: string;                   // Component/module name (e.g., "AudioEngine", "SyncManager")
  file?: string;                    // Source file path (sanitized)
  line?: number;                    // Line number
  function?: string;                // Function/method name

  // Thread/task context
  threadId?: string;                // For multi-threaded apps
  correlationId?: string;           // Links related operations

  // Arbitrary context (PII scanned and redacted)
  context: Record<string, unknown>;

  // PII redaction metadata
  redaction?: {
    fieldsRedacted: string[];     // Which fields were scrubbed
    patternsMatched: string[];    // Which PII patterns found (email, ssn, etc.)
  };
}

DebugScreenshotDoc (Metadata only — image in Blob)

interface DebugScreenshotDoc {
  id: string; // scr_<uuid>
  sessionId: string; // Partition key for queries
  productId: string;

  // Storage reference (actual image in Azure Blob)
  blobUrl: string; // SAS URL to blob (time-limited)
  blobPath: string; // Path in container: `screenshots/${productId}/${sessionId}/${id}.png`
  containerName: string; // Azure Blob container (e.g., "diagnostics-screenshots")

  // Screenshot metadata
  capturedAt: string; // When captured
  trigger: 'manual' | 'error' | 'interval' | 'user_request'; // Why taken

  // Dimensions & format
  width: number;
  height: number;
  format: 'png' | 'jpeg' | 'webp';
  sizeBytes: number;

  // Privacy
  sensitiveViewsBlurred: boolean; // Whether PII areas were blurred
  blurRegions?: Array<{ x: number; y: number; w: number; h: number }>; // If partial blur

  // Optional context
  screenName?: string; // Current screen/view when captured
  breadcrumbAtCapture?: string; // Last breadcrumb before screenshot
}

Appendix B: API Reference

Method Endpoint Auth Rate Limit Description
POST /api/diagnostics/sessions Admin 10/hour/admin Create debug session
GET /api/diagnostics/sessions Admin 100/min List sessions (paginated)
GET /api/diagnostics/sessions/:id Admin/Owner 100/min Get session details
PATCH /api/diagnostics/sessions/:id Admin 10/min Update session status
DELETE /api/diagnostics/sessions/:id Admin 10/min Cancel session (soft delete)
GET /api/diagnostics/config Any auth 1/5sec/device Poll for active session (ETag cached)
POST /api/diagnostics/ingest Any auth 100/min/device Submit traces/logs batch (max 50 items)
POST /api/diagnostics/sessions/:id/traces Any auth 100/min/device Ingest trace spans
POST /api/diagnostics/sessions/:id/logs Any auth 100/min/device Ingest log entries
POST /api/diagnostics/sessions/:id/screenshots Any auth 10/min/device Get SAS URL for screenshot upload
GET /api/diagnostics/sessions/:id/traces Admin 100/min Query traces for session
GET /api/diagnostics/sessions/:id/logs Admin 100/min Query logs with filters
GET /api/diagnostics/sessions/:id/screenshots Admin 100/min List screenshot metadata

Appendix C: Industry Comparison

Capability Firebase Crashlytics Sentry Datadog RUM Our Solution
Crash Reporting (via telemetry)
Error Tracking (via telemetry)
Breadcrumbs
Custom Traces ⚠️ Limited
Network Tracing
Console Logs ⚠️ Error only (all levels)
Session Replay 🟡 Future
Remote Trigger
On-Device Debug
Screenshots ⚠️ Crash only
Open Source (SDK)

Appendix D: Privacy & Security

D.1 PII Redaction Patterns (server-side)

Pattern Regex Redaction Method Example
Email [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,} Replace with [EMAIL] user@example.com[EMAIL]
SSN (US) \b\d{3}-?\d{2}-?\d{4}\b Replace with [SSN] 123-45-6789[SSN]
Credit Card \b(?:\d[ -]*?){13,16}\b Replace with [CC] 4111 1111 1111 1111[CC]
Phone \b\+?1?\s?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}\b Replace with [PHONE] +1 (555) 123-4567[PHONE]
IP Address \b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b Replace with [IP] 192.168.1.1[IP]
Password/Token `(?i)(password token secret key)\s*[:=]\s*\S+` Replace with [CREDENTIAL] password: secret123password: [CREDENTIAL]
JWT eyJ[a-zA-Z0-9_-]*\.eyJ[a-zA-Z0-9_-]*\.[a-zA-Z0-9_-]* Replace with [JWT] Full JWT → [JWT]
  • 1. PII Redaction: Implement all patterns above in lib/pii-redaction.ts (shared with telemetry)
  • 2. Redaction Metadata: Store which patterns matched in redaction.fieldsRedacted for transparency
  • 3. Consent Tracking: userConsent field in session doc with consentedAt and consentMethod
  • 4. Data Retention: 7-day default TTL for sessions/traces, 3-day for logs (Cosmos TTL)
  • 5. Access Control: Admin-only session creation; users can only view their own sessions via targetUserId check
  • 6. Encryption: All data encrypted at rest (Cosmos), in transit (TLS 1.3)
  • 7. Audit Trail: All session operations logged to audit_log container (90-day retention)
  • 8. User Notification: Email/push notification when debug session started on their device

Appendix E: Event Bus Events

Event Name Payload Publishers Subscribers
diagnostics.session.created { sessionId, productId, targetUserId, createdBy } diagnostics module notifications → email/push user
diagnostics.session.started { sessionId, productId, startedAt } diagnostics module
diagnostics.session.updated { sessionId, productId, changes, updatedBy } diagnostics module audit log
diagnostics.session.cancelled { sessionId, productId, reason, cancelledBy } diagnostics module notifications → admin
diagnostics.session.completed { sessionId, productId, stats, endedAt } diagnostics module notifications → admin summary email
diagnostics.session.expired { sessionId, productId, expiredAt } diagnostics module (TTL job)
diagnostics.ingest.fatal { sessionId, productId, logEntry, timestamp } diagnostics module (on fatal log) PagerDuty/Slack alert
diagnostics.screenshot.captured { sessionId, productId, screenshotId, trigger } diagnostics module

Current Status

  • Design complete — 2026-03-02
  • Review complete — 10 bugs/gaps identified and fixed
  • Phase 1: Server Foundation — COMPLETE — 2026-03-03
    • 17 diagnostics tests passing, 839 total platform-service tests
    • Event bus subscribers registered for 8 events
    • Audit logging for all session lifecycle events
    • Rate limiting keys configured
    • 4 email templates ready for notifications
  • Phase 2: Client SDKs — COMPLETE — 2026-03-03
    • TypeScript SDK: 21 tests passing
    • Swift SDK: 20+ tests, iOS 15+ support
    • Kotlin SDK: 16+ tests, API 26+ support
  • Phase 3: Admin UI — COMPLETE — 2026-03-03
    • Debug Sessions list page (3.1) with filters
    • Session Detail view (3.2) with 5 tabs
    • Real-time auto-refresh (5s polling)
  • Phase 4: Advanced Features — Future

Total Tasks: 140+ checkboxes across 4 phases

Last Updated: 2026-03-03