Phase 1.5 items completed: - 1.5.1: Wired into server.ts (routes + subscribers) - 1.5.2: Event bus subscribers for 8 events + email templates - 1.5.3: Audit logging for all session lifecycle events - 1.5.4: Rate limiting keys configured Phase 1 now COMPLETE with 17 diagnostics tests passing Next: Phase 2 Client SDKs (TS/Swift/Kotlin)
28 KiB
28 KiB
Remote Diagnostics & Debug Tracing — Implementation Roadmap
Module:
platform-service/src/modules/diagnostics/
Client SDK:@bytelyst/diagnostics
Target: E2E debug collection from any device, on-demand triggers, industry-parity features
Estimated Effort: 2–3 weeks
Status: 🟡 Planning
Executive Summary
This roadmap delivers a Datadog/Sentry-grade remote diagnostics system for the ByteLyst ecosystem. Unlike passive telemetry (which we have), this enables active debugging sessions where engineers can remotely trigger deep diagnostics collection from any user device.
Key Differentiators vs. Existing Telemetry
| Feature | Existing Telemetry | Remote Diagnostics |
|---|---|---|
| Trigger | Passive (always sampling) | Active (engineer-initiated) |
| Log Level | Static config | Dynamic (debug/trace per session) |
| Network Tracing | None | Full HTTP capture |
| Breadcrumbs | Basic events | Rich timeline (user journey) |
| Console Logs | Error-only | Full capture (debug→fatal) |
| Screenshots | None | Auto-capture on crash |
| Session Replay | None | Future: video-style replay |
Phase 1: Server Foundation (Week 1)
1.1 Data Model & Schemas
- 1.1.1 Create
modules/diagnostics/types.ts—f51c352DebugSessionDoc— session metadata (status, target, config)DebugTraceDoc— trace spans with timingDebugLogEntryDoc— structured log entriesDebugScreenshotDoc— metadata for blob storage- Zod schemas for all inputs
- 1.1.2 Add Cosmos containers to
cosmos-init.ts—dea1521debug_sessions(pk:/id, TTL: 7 days)debug_traces(pk:/pkwith composite${productId}:${sessionId}, TTL: 7 days)debug_logs(pk:/pkwith composite${productId}:${sessionId}, TTL: 3 days)debug_screenshotsmetadata (pk:/sessionId) — actual images stored in Azure Blob
1.2 Repository Layer
- 1.2.1 Create
modules/diagnostics/repository.ts—f272a44- Use
@bytelyst/datastoregetCollection()pattern (seetelemetry/repository.ts) createSession()— initiate debug session, emitdiagnostics.session.createdeventgetSession()— fetch session by ID (cross-partition query via/idpk)getSessionForIngest()— optimized lookup for client ingest (query bysessionIdfield)updateSession()— status changes, emitdiagnostics.session.updatedeventlistSessions()— query byproductIdfield with paginationdeleteSession()— manual cleanup, emitdiagnostics.session.deletedeventingestTrace()— batch upsert traces (useupsert()for idempotency)ingestLogs()— batch upsert logs with PII scan (reusetelemetryPII patterns)getTraces()— query by composite pk prefix${productId}:${sessionId}getLogs()— query by composite pk with level filtersupdateSessionStats()— denormalize logCount/traceCount atomically
- Use
1.3 REST API Routes
- 1.3.1 Create
modules/diagnostics/routes.ts—a66a689- Apply
requireRole('admin')for all session management routes - Apply rate limiting: 10 session creates per admin per hour (prevent abuse)
POST /diagnostics/sessions— create session (admin only)GET /diagnostics/sessions— list sessions (admin only)GET /diagnostics/sessions/:id— get session details (admin or session owner)PATCH /diagnostics/sessions/:id— update session (admin only)DELETE /diagnostics/sessions/:id— cancel session (admin only)GET /diagnostics/config— client polling endpoint (any authenticated user)POST /diagnostics/ingest— batch trace/log ingestion (any authenticated user)POST /diagnostics/sessions/:id/traces— ingest trace spansPOST /diagnostics/sessions/:id/logs— ingest log entriesPOST /diagnostics/sessions/:id/screenshots— get SAS URL for screenshot uploadGET /diagnostics/sessions/:id/traces— query traces for sessionGET /diagnostics/sessions/:id/logs— query logs with filtersGET /diagnostics/sessions/:id/screenshots— list screenshot metadata
- Apply
1.4 Testing
- 1.4.1 Create
modules/diagnostics/diagnostics.test.ts—fb71981- Session CRUD tests (6 tests implemented, 4 pending)
- Trace ingestion tests (2 tests implemented, 6 pending)
- Log ingestion tests (3 tests implemented, 5 pending)
- Schema validation tests (5 tests)
- Config polling tests (6 tests) — PENDING Phase 1.5
- Screenshot tests (6 tests) — PENDING Phase 1.5
- Target: 14+ tests implemented (38 target for full Phase 1)
1.5 Integration
- 1.5.1 Wire into
server.ts—d444a8d- Import
diagnosticsRoutesfrom./modules/diagnostics/routes.js - Import
registerDiagnosticsSubscribersfrom./modules/diagnostics/subscribers.js - Register:
await app.register(diagnosticsRoutes, { prefix: '/api' }) - Register:
registerDiagnosticsSubscribers(app.log)at startup - Add after telemetry routes (logical grouping)
- Import
- 1.5.2 Event Bus Integration (
lib/event-bus.ts) —30583a1- Subscribers registered for all 8 diagnostics events
- Email templates added (session-created, cancelled, completed, fatal-alert)
- Send notification to target user (email/push) — pending user lookup
- Notify admin who started session — pending admin lookup
- Alert on-call engineer (PagerDuty/Slack) — future integration
- 1.5.3 Audit Logging (
modules/audit/) —30583a1- Log all session lifecycle events (create, started, updated, cancel, completed, expired)
- Log fatal log ingest and screenshot capture
- Include target user ID, admin ID, session config in audit trail
- Retention: 90 days via
audit_logcontainer TTL
- 1.5.4 Rate Limiting Registration —
30583a1- Add
diagnostics:session:createrate limit key (10/hour per admin) - Add
diagnostics:config:pollrate limit key (1/5sec per device) - Add
diagnostics:ingest:submitrate limit key (100/min per device)
- Add
Phase 1 Exit Criteria:
- All routes return 200 with correct payloads
- 17 tests passing (diagnostics module) / 839 total platform-service tests
- Event bus subscribers registered and tested
- Audit logs written for all session operations
- Rate limiting enforced
- PII redaction working in log ingest
- Admin can create session via API
- 38+ tests target (deferred: config polling, screenshot tests — Phase 2)
Phase 2: Client SDK Abstractions (Week 1–2)
2.1 TypeScript Client SDK
- 2.1.1 Create
@bytelyst/diagnosticspackagepackage.jsonwith ESM exportstsconfig.jsonextending base
- 2.1.2 Core types (
src/types.ts)DiagnosticsSessioninterfaceTraceSpaninterfaceLogLevelenum (debug, info, warn, error, fatal)DiagnosticsConfigfrom server
- 2.1.3 Main client (
src/client.ts)DiagnosticsClientclass (singleton)start()— begin polling for active sessionsstop()— cease pollingisSessionActive()— check current statetrace(name, fn)— auto-instrumented span wrapperlog(level, message, context)— structured loggingbreadcrumb(message, category)— add timeline marker
- 2.1.4 Network interceptor (
src/network.ts)NetworkInterceptorclass- Wrap
fetch()for capture - Capture: URL, method, headers (sanitized), timing, status
- Configurable URL patterns (include/exclude)
- 2.1.5 Breadcrumb trail (
src/breadcrumbs.ts)- Ring buffer (max 100 entries)
- Auto-capture: navigation, clicks, errors
- Manual:
breadcrumb()API
- 2.1.6 Device state (
src/device.ts)- Memory, battery, storage, thermal, network type
- 2.1.7 Screenshot capture (
src/screenshot.ts)html2canvasintegration (browser)- Auto-capture on error (configurable)
- 2.1.8 Tests (
src/__tests__/diagnostics.test.ts)- Client lifecycle tests (4)
- Trace recording tests (4)
- Log buffering tests (4)
- Network interception tests (4)
- Breadcrumb tests (4)
- Target: 20+ Vitest tests
2.2 Swift Client SDK (iOS)
- 2.2.1 Create
ByteLystDiagnosticsSwift packagePackage.swiftwith iOS 15+ target- Module structure: Core, Network, UI, Device
- 2.2.2 Core client (
Sources/Core/DiagnosticsClient.swift)DiagnosticsClientactor (thread-safe)start()— polling withTimertrace<T>(name, operation)— async span wrapperlog(level, message, metadata)— os_log integrationbreadcrumb(category, message)— timeline
- 2.2.3 Network interception (
Sources/Network/URLSessionInterceptor.swift)URLProtocolsubclass for automatic capture- Swizzle
URLSessionor use protocol registration - Capture: request/response, timing, bytes
- 2.2.4 Device state (
Sources/Device/DeviceState.swift)UIDeviceintegration (battery, thermal)ProcessInfo(memory pressure)NetworkMonitor(path status)
- 2.2.5 Screenshot (
Sources/UI/ScreenshotCapture.swift)UIApplicationkey window capture- Privacy: blur sensitive views (configurable)
- 2.2.6 Tests (
Tests/DiagnosticsClientTests.swift)- Unit tests (12)
- Integration tests (8)
2.3 Kotlin Client SDK (Android)
- 2.3.1 Create
diagnosticsmodule inkotlin-platform-sdkbuild.gradle.ktsdependencies- Coroutines + OkHttp + WorkManager
- 2.3.2 Core client (
diagnostics/DiagnosticsClient.kt)- Singleton with
StateFlow<DiagnosticsState> start()— foreground service for pollingtrace()— suspend function with spanlog()— Logcat + structured queue
- Singleton with
- 2.3.3 Network interceptor (
diagnostics/OkHttpInterceptor.kt)Interceptorimplementation- Capture: request/response chain
- 2.3.4 Device state (
diagnostics/DeviceStateCollector.kt)BatteryManager,ActivityManager,StorageStatsManager
- 2.3.5 Screenshot (
diagnostics/ScreenshotCapture.kt)MediaProjectionAPI (with permission)PixelCopyfor surface capture
- 2.3.6 Tests (
diagnostics/DiagnosticsClientTest.kt)- Unit tests (10)
- Integration tests (6)
Phase 2 Exit Criteria:
- TS SDK builds + 20 tests passing
- Swift SDK builds + 20 tests passing
- Kotlin SDK builds + 16 tests passing
- All SDKs can poll config endpoint
Phase 3: Admin Dashboard UI (Week 2)
3.1 Debug Sessions Page
- 3.1.1 Create
/ops/debug-sessions/page.tsx- Session list table (columns: ID, user, device, status, started, duration)
- Filters: status, product, date range
- Pagination
- "New Session" button → modal
- 3.1.2 New Session Modal
- Target user (email/userId search)
- Target device (dropdown from sessions)
- Collection level (standard, debug, trace)
- Duration slider (5min → 24hr)
- Screenshot on error (toggle)
- "Start Session" → POST API
3.2 Session Detail View
- 3.2.1 Create
/ops/debug-sessions/[id]/page.tsx- Session header (status badge, user info, device info)
- Action buttons: Extend (+30min), Stop, Download
- Tabs: Timeline, Logs, Network, Traces, Screenshots
- 3.2.2 Timeline Tab
- Breadcrumb list (time, category, message)
- Visual timeline (horizontal bar)
- Click to jump to related trace/log
- 3.2.3 Logs Tab
- Log level filters (debug, info, warn, error)
- Search/filter
- Expandable rows with context
- Syntax highlighting
- 3.2.4 Network Tab
- Request list (time, method, URL, status, duration)
- Click to view: request/response headers, body
- Filter by status code, URL pattern
- 3.2.5 Traces Tab
- Trace tree view (spans, parent-child)
- Timing visualization (waterfall)
- Search by name
- 3.2.6 Screenshots Tab
- Grid of thumbnails
- Click to expand
- Download all as ZIP
3.3 Real-time Updates
- 3.3.1 Server-sent events or polling
- Auto-refresh session status every 5 seconds
- Toast notification on new logs/traces
3.4 Client Library
- 3.4.1 Create
lib/diagnostics-client.tsquerySessions()createSession()getSession()updateSession()getTraces()getLogs()
Phase 3 Exit Criteria:
- Admin can create session from UI
- Session detail shows live data
- All 4 tabs functional
Phase 4: Advanced Features (Week 3)
4.1 Automated Triggers
- 4.1.1 Error-threshold triggers
- Config: "Start debug session if error rate > X%"
- Background job checks every 5 minutes
- Auto-notify on Slack/Teams
- 4.1.2 Crash-triggered sessions
- Client sends crash → server auto-starts session
- Captures 60 seconds pre-crash context
4.2 Session Replay (Future)
- 4.2.1 DOM/View state capture
- Record user interactions (clicks, scrolls, inputs)
- Replay as video-like timeline
- Privacy: exclude password fields
4.3 Performance Profiling
- 4.3.1 CPU/Memory profiling
- iOS:
os_signpostintegration - Android:
Debug.MemoryInfo - Web:
performance.now()+ memory API
- iOS:
4.4 Integration Tests
- 4.4.1 E2E test: Admin creates session → Client captures → Admin views
- Playwright test (web client)
- XCTest UI test (iOS)
- Espresso test (Android)
Phase 4 Exit Criteria:
- Auto-trigger tests passing
- E2E flow working end-to-end
Appendix A: Data Models
DebugSessionDoc
interface DebugSessionDoc {
id: string; // ds_<uuid> — also the partition key (/id)
productId: string; // For filtering/querying (not pk to avoid hot partitions)
// Target (at least one required)
targetUserId?: string; // For authenticated users
targetAnonymousId?: string; // For anonymous users (installId)
targetDeviceId?: string; // Specific device fingerprint
targetSessionId?: string; // Specific app session to capture
// Status lifecycle: pending → active → paused → completed | cancelled
status: 'pending' | 'active' | 'paused' | 'completed' | 'cancelled';
// Collection configuration
collectionLevel: 'standard' | 'debug' | 'trace';
captureLogs: boolean;
captureNetwork: boolean;
captureScreenshots: boolean;
screenshotOnError: boolean;
maxDurationMinutes: number; // Default: 60, Max: 1440 (24h)
// Timestamps
createdAt: string; // ISO 8601
updatedAt: string; // Last status/config change
startedAt?: string; // When status became 'active'
endedAt?: string; // When status became 'completed'|'cancelled'
expiresAt: string; // Auto-expiry (createdAt + maxDurationMinutes)
// Stats (denormalized for fast reads, updated via ingest)
logCount: number;
traceCount: number;
screenshotCount: number;
// Audit
createdBy: string; // Admin userId who created session
updatedBy?: string; // Last admin to modify
// Consent tracking (privacy compliance)
userConsent?: {
consentedAt: string;
consentMethod: 'prompt' | 'pre_consent' | 'auto'; // How user agreed
};
}
DebugTraceDoc (OpenTelemetry-compatible)
interface DebugTraceDoc {
id: string; // tr_<uuid>
pk: string; // Composite: `${productId}:${sessionId}` — partition key
sessionId: string; // For queries (also part of pk)
productId: string; // For filtering
// OpenTelemetry trace context
traceId: string; // OTel trace ID (hex)
parentId?: string; // Parent span ID (null for root)
spanId: string; // This span's ID
name: string; // Operation name (e.g., "UserLogin", "API.fetchUser")
kind?: 'internal' | 'server' | 'client' | 'producer' | 'consumer';
// Timing (nanosecond precision for OTel compatibility)
startTime: string; // ISO 8601
endTime?: string;
durationMs?: number;
// Context and attributes
attributes: Record<string, unknown>; // Custom key-value pairs
status: 'ok' | 'error' | 'unset';
statusMessage?: string; // Error description if status=error
// Events (spans within span — e.g., "db.query", "cache.hit")
events?: Array<{
name: string;
timestamp: string;
attributes?: Record<string, unknown>;
}>;
// Links to other traces (for async operations)
links?: Array<{
traceId: string;
spanId: string;
attributes?: Record<string, unknown>;
}>;
}
### DebugLogEntryDoc
```typescript
interface DebugLogEntryDoc {
id: string; // log_<uuid>
pk: string; // Composite: `${productId}:${sessionId}` — partition key
sessionId: string; // For queries (also part of pk)
productId: string; // For filtering
// Log level (matches syslog/OTel severity)
level: 'debug' | 'info' | 'warn' | 'error' | 'fatal';
message: string; // Original message (PII redacted server-side)
messageHash?: string; // SHA-256 for deduplication
// Timestamp (client clock, server enriches with receivedAt)
timestamp: string; // ISO 8601 — when log was generated
receivedAt?: string; // Server-side ingestion time
// Source context
module: string; // Component/module name (e.g., "AudioEngine", "SyncManager")
file?: string; // Source file path (sanitized)
line?: number; // Line number
function?: string; // Function/method name
// Thread/task context
threadId?: string; // For multi-threaded apps
correlationId?: string; // Links related operations
// Arbitrary context (PII scanned and redacted)
context: Record<string, unknown>;
// PII redaction metadata
redaction?: {
fieldsRedacted: string[]; // Which fields were scrubbed
patternsMatched: string[]; // Which PII patterns found (email, ssn, etc.)
};
}
DebugScreenshotDoc (Metadata only — image in Blob)
interface DebugScreenshotDoc {
id: string; // scr_<uuid>
sessionId: string; // Partition key for queries
productId: string;
// Storage reference (actual image in Azure Blob)
blobUrl: string; // SAS URL to blob (time-limited)
blobPath: string; // Path in container: `screenshots/${productId}/${sessionId}/${id}.png`
containerName: string; // Azure Blob container (e.g., "diagnostics-screenshots")
// Screenshot metadata
capturedAt: string; // When captured
trigger: 'manual' | 'error' | 'interval' | 'user_request'; // Why taken
// Dimensions & format
width: number;
height: number;
format: 'png' | 'jpeg' | 'webp';
sizeBytes: number;
// Privacy
sensitiveViewsBlurred: boolean; // Whether PII areas were blurred
blurRegions?: Array<{ x: number; y: number; w: number; h: number }>; // If partial blur
// Optional context
screenName?: string; // Current screen/view when captured
breadcrumbAtCapture?: string; // Last breadcrumb before screenshot
}
Appendix B: API Reference
| Method | Endpoint | Auth | Rate Limit | Description |
|---|---|---|---|---|
| POST | /api/diagnostics/sessions |
Admin | 10/hour/admin | Create debug session |
| GET | /api/diagnostics/sessions |
Admin | 100/min | List sessions (paginated) |
| GET | /api/diagnostics/sessions/:id |
Admin/Owner | 100/min | Get session details |
| PATCH | /api/diagnostics/sessions/:id |
Admin | 10/min | Update session status |
| DELETE | /api/diagnostics/sessions/:id |
Admin | 10/min | Cancel session (soft delete) |
| GET | /api/diagnostics/config |
Any auth | 1/5sec/device | Poll for active session (ETag cached) |
| POST | /api/diagnostics/ingest |
Any auth | 100/min/device | Submit traces/logs batch (max 50 items) |
| POST | /api/diagnostics/sessions/:id/traces |
Any auth | 100/min/device | Ingest trace spans |
| POST | /api/diagnostics/sessions/:id/logs |
Any auth | 100/min/device | Ingest log entries |
| POST | /api/diagnostics/sessions/:id/screenshots |
Any auth | 10/min/device | Get SAS URL for screenshot upload |
| GET | /api/diagnostics/sessions/:id/traces |
Admin | 100/min | Query traces for session |
| GET | /api/diagnostics/sessions/:id/logs |
Admin | 100/min | Query logs with filters |
| GET | /api/diagnostics/sessions/:id/screenshots |
Admin | 100/min | List screenshot metadata |
Appendix C: Industry Comparison
| Capability | Firebase Crashlytics | Sentry | Datadog RUM | Our Solution |
|---|---|---|---|---|
| Crash Reporting | ✅ | ✅ | ✅ | ✅ (via telemetry) |
| Error Tracking | ✅ | ✅ | ✅ | ✅ (via telemetry) |
| Breadcrumbs | ✅ | ✅ | ✅ | ✅ |
| Custom Traces | ⚠️ Limited | ✅ | ✅ | ✅ |
| Network Tracing | ❌ | ✅ | ✅ | ✅ |
| Console Logs | ⚠️ Error only | ✅ | ✅ | ✅ (all levels) |
| Session Replay | ❌ | ✅ | ✅ | 🟡 Future |
| Remote Trigger | ❌ | ✅ | ❌ | ✅ |
| On-Device Debug | ❌ | ❌ | ❌ | ✅ |
| Screenshots | ⚠️ Crash only | ✅ | ❌ | ✅ |
| Open Source | ❌ | ✅ (SDK) | ❌ | ✅ |
Appendix D: Privacy & Security
D.1 PII Redaction Patterns (server-side)
| Pattern | Regex | Redaction Method | Example | |||
|---|---|---|---|---|---|---|
[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,} |
Replace with [EMAIL] |
user@example.com → [EMAIL] |
||||
| SSN (US) | \b\d{3}-?\d{2}-?\d{4}\b |
Replace with [SSN] |
123-45-6789 → [SSN] |
|||
| Credit Card | \b(?:\d[ -]*?){13,16}\b |
Replace with [CC] |
4111 1111 1111 1111 → [CC] |
|||
| Phone | \b\+?1?\s?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}\b |
Replace with [PHONE] |
+1 (555) 123-4567 → [PHONE] |
|||
| IP Address | \b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b |
Replace with [IP] |
192.168.1.1 → [IP] |
|||
| Password/Token | `(?i)(password | token | secret | key)\s*[:=]\s*\S+` | Replace with [CREDENTIAL] |
password: secret123 → password: [CREDENTIAL] |
| JWT | eyJ[a-zA-Z0-9_-]*\.eyJ[a-zA-Z0-9_-]*\.[a-zA-Z0-9_-]* |
Replace with [JWT] |
Full JWT → [JWT] |
- 1. PII Redaction: Implement all patterns above in
lib/pii-redaction.ts(shared with telemetry) - 2. Redaction Metadata: Store which patterns matched in
redaction.fieldsRedactedfor transparency - 3. Consent Tracking:
userConsentfield in session doc withconsentedAtandconsentMethod - 4. Data Retention: 7-day default TTL for sessions/traces, 3-day for logs (Cosmos TTL)
- 5. Access Control: Admin-only session creation; users can only view their own sessions via
targetUserIdcheck - 6. Encryption: All data encrypted at rest (Cosmos), in transit (TLS 1.3)
- 7. Audit Trail: All session operations logged to
audit_logcontainer (90-day retention) - 8. User Notification: Email/push notification when debug session started on their device
Appendix E: Event Bus Events
| Event Name | Payload | Publishers | Subscribers |
|---|---|---|---|
diagnostics.session.created |
{ sessionId, productId, targetUserId, createdBy } |
diagnostics module | notifications → email/push user |
diagnostics.session.started |
{ sessionId, productId, startedAt } |
diagnostics module | — |
diagnostics.session.updated |
{ sessionId, productId, changes, updatedBy } |
diagnostics module | audit log |
diagnostics.session.cancelled |
{ sessionId, productId, reason, cancelledBy } |
diagnostics module | notifications → admin |
diagnostics.session.completed |
{ sessionId, productId, stats, endedAt } |
diagnostics module | notifications → admin summary email |
diagnostics.session.expired |
{ sessionId, productId, expiredAt } |
diagnostics module (TTL job) | — |
diagnostics.ingest.fatal |
{ sessionId, productId, logEntry, timestamp } |
diagnostics module (on fatal log) | PagerDuty/Slack alert |
diagnostics.screenshot.captured |
{ sessionId, productId, screenshotId, trigger } |
diagnostics module | — |
Current Status
- Design complete — 2026-03-02
- Review complete — 10 bugs/gaps identified and fixed
- Phase 1: Server Foundation — COMPLETE — 2026-03-03
- 17 diagnostics tests passing, 839 total platform-service tests
- Event bus subscribers registered for 8 events
- Audit logging for all session lifecycle events
- Rate limiting keys configured
- 4 email templates ready for notifications
- Phase 2: Client SDKs — Next (TS/Swift/Kotlin)
- Phase 3: Admin UI
- Phase 4: Advanced Features
Total Tasks: 140+ checkboxes across 4 phases
Last Updated: 2026-03-03