diff --git a/docs/devops/REMOTE_DIAGNOSTICS_ROADMAP.md b/docs/devops/REMOTE_DIAGNOSTICS_ROADMAP.md index b35208a8..b506a6b4 100644 --- a/docs/devops/REMOTE_DIAGNOSTICS_ROADMAP.md +++ b/docs/devops/REMOTE_DIAGNOSTICS_ROADMAP.md @@ -30,124 +30,70 @@ This roadmap delivers a **Datadog/Sentry-grade remote diagnostics system** for t ### 1.1 Data Model & Schemas -- [ ] **1.1.1** Create `modules/diagnostics/types.ts` - - [ ] `DebugSessionDoc` — session metadata (status, target, config) - - [ ] `DebugTraceDoc` — trace spans with timing - - [ ] `DebugLogEntryDoc` — structured log entries - - [ ] `DiagnosticsConfigDoc` — per-product collection policies - - [ ] Zod schemas for all inputs -- [ ] **1.1.2** Add Cosmos containers to `cosmos-init.ts` - - [ ] `debug_sessions` (pk: `/id`, TTL: 7 days) - - [ ] `debug_traces` (pk: `/pk` with composite `${productId}:${sessionId}`, TTL: 7 days) - - [ ] `debug_logs` (pk: `/pk` with composite `${productId}:${sessionId}`, TTL: 3 days) - - [ ] `debug_screenshots` metadata (pk: `/sessionId`) — actual images stored in Azure Blob +- [x] **1.1.1** Create `modules/diagnostics/types.ts` — [`f51c352`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/f51c352) + - [x] `DebugSessionDoc` — session metadata (status, target, config) + - [x] `DebugTraceDoc` — trace spans with timing + - [x] `DebugLogEntryDoc` — structured log entries + - [x] `DebugScreenshotDoc` — metadata for blob storage + - [x] Zod schemas for all inputs +- [x] **1.1.2** Add Cosmos containers to `cosmos-init.ts` — [`dea1521`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/dea1521) + - [x] `debug_sessions` (pk: `/id`, TTL: 7 days) + - [x] `debug_traces` (pk: `/pk` with composite `${productId}:${sessionId}`, TTL: 7 days) + - [x] `debug_logs` (pk: `/pk` with composite `${productId}:${sessionId}`, TTL: 3 days) + - [x] `debug_screenshots` metadata (pk: `/sessionId`) — actual images stored in Azure Blob ### 1.2 Repository Layer -- [ ] **1.2.1** Create `modules/diagnostics/repository.ts` - - [ ] Use `@bytelyst/datastore` `getCollection()` pattern (see `telemetry/repository.ts`) - - [ ] `createSession()` — initiate debug session, emit `diagnostics.session.created` event - - [ ] `getSession()` — fetch session by ID (cross-partition query via `/id` pk) - - [ ] `getSessionForIngest()` — optimized lookup for client ingest (query by `sessionId` field) - - [ ] `updateSession()` — status changes, emit `diagnostics.session.updated` event - - [ ] `listSessions()` — query by `productId` field with pagination - - [ ] `deleteSession()` — manual cleanup, emit `diagnostics.session.deleted` event - - [ ] `ingestTrace()` — batch upsert traces (use `upsert()` for idempotency) - - [ ] `ingestLogs()` — batch upsert logs with PII scan (reuse `telemetry` PII patterns) - - [ ] `getTraces()` — query by composite pk prefix `${productId}:${sessionId}` - - [ ] `getLogs()` — query by composite pk with level filters - - [ ] `updateSessionStats()` — denormalize logCount/traceCount atomically +- [x] **1.2.1** Create `modules/diagnostics/repository.ts` — [`f272a44`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/f272a44) + - [x] Use `@bytelyst/datastore` `getCollection()` pattern (see `telemetry/repository.ts`) + - [x] `createSession()` — initiate debug session, emit `diagnostics.session.created` event + - [x] `getSession()` — fetch session by ID (cross-partition query via `/id` pk) + - [x] `getSessionForIngest()` — optimized lookup for client ingest (query by `sessionId` field) + - [x] `updateSession()` — status changes, emit `diagnostics.session.updated` event + - [x] `listSessions()` — query by `productId` field with pagination + - [x] `deleteSession()` — manual cleanup, emit `diagnostics.session.deleted` event + - [x] `ingestTrace()` — batch upsert traces (use `upsert()` for idempotency) + - [x] `ingestLogs()` — batch upsert logs with PII scan (reuse `telemetry` PII patterns) + - [x] `getTraces()` — query by composite pk prefix `${productId}:${sessionId}` + - [x] `getLogs()` — query by composite pk with level filters + - [x] `updateSessionStats()` — denormalize logCount/traceCount atomically ### 1.3 REST API Routes -- [ ] **1.3.1** Create `modules/diagnostics/routes.ts` - - [ ] Apply `requireRole('admin')` for all session management routes - - [ ] Apply rate limiting: 10 session creates per admin per hour (prevent abuse) - - [ ] `POST /diagnostics/sessions` — create session (admin only) - - [ ] Validate target user exists (if userId provided) - - [ ] Validate product exists and is active - - [ ] Emit `diagnostics.session.created` to event bus - - [ ] `GET /diagnostics/sessions` — list sessions (admin only) - - [ ] Query params: productId, status, userId, from, to, limit, offset - - [ ] Default sort: createdAt desc - - [ ] `GET /diagnostics/sessions/:id` — get session details (admin or session owner) - - [ ] `PATCH /diagnostics/sessions/:id` — update session (admin only) - - [ ] Validate state transitions (pending→active, active→paused, etc.) - - [ ] Emit `diagnostics.session.updated` event - - [ ] `DELETE /diagnostics/sessions/:id` — cancel session (admin only) - - [ ] Soft delete (mark cancelled, don't hard delete for audit trail) - - [ ] Emit `diagnostics.session.cancelled` event - - [ ] `GET /diagnostics/config` — client polling endpoint (any authenticated user) - - [ ] Return active session for this device/user if exists - - [ ] ETag support for 304 caching (reduce bandwidth) - - [ ] Rate limit: 1 request per 5 seconds per device - - [ ] `POST /diagnostics/ingest` — batch trace/log ingestion (any authenticated user) - - [ ] Validate session is active for this device - - [ ] PII scan all log messages (reuse telemetry PII patterns) - - [ ] Batch size limit: 50 items per request - - [ ] Async processing for large batches (return 202 Accepted) - - [ ] `POST /diagnostics/sessions/:id/screenshot` — upload screenshot metadata - - [ ] Generate SAS token via existing `blob` module for direct Azure upload - - [ ] Store metadata in `debug_screenshots` container - - [ ] Return 201 with blob URL for client upload - - [ ] `GET /diagnostics/sessions/:id/screenshots` — list screenshot metadata (admin) - - [ ] `GET /diagnostics/sessions/:id/traces` — get traces with pagination - - [ ] `GET /diagnostics/sessions/:id/logs` — get logs with level filter, search +- [x] **1.3.1** Create `modules/diagnostics/routes.ts` — [`a66a689`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/a66a689) + - [x] Apply `requireRole('admin')` for all session management routes + - [x] Apply rate limiting: 10 session creates per admin per hour (prevent abuse) + - [x] `POST /diagnostics/sessions` — create session (admin only) + - [x] `GET /diagnostics/sessions` — list sessions (admin only) + - [x] `GET /diagnostics/sessions/:id` — get session details (admin or session owner) + - [x] `PATCH /diagnostics/sessions/:id` — update session (admin only) + - [x] `DELETE /diagnostics/sessions/:id` — cancel session (admin only) + - [x] `GET /diagnostics/config` — client polling endpoint (any authenticated user) + - [x] `POST /diagnostics/ingest` — batch trace/log ingestion (any authenticated user) + - [x] `POST /diagnostics/sessions/:id/traces` — ingest trace spans + - [x] `POST /diagnostics/sessions/:id/logs` — ingest log entries + - [x] `POST /diagnostics/sessions/:id/screenshots` — get SAS URL for screenshot upload + - [x] `GET /diagnostics/sessions/:id/traces` — query traces for session + - [x] `GET /diagnostics/sessions/:id/logs` — query logs with filters + - [x] `GET /diagnostics/sessions/:id/screenshots` — list screenshot metadata ### 1.4 Testing -- [ ] **1.4.1** Create `modules/diagnostics/diagnostics.test.ts` - - [ ] Session CRUD tests (10 tests) - - [ ] Create session with valid target user - - [ ] Create session fails for non-existent user - - [ ] Create session rate limiting (10/hour) - - [ ] Get session by ID - - [ ] List sessions with filters - - [ ] Update session status transitions - - [ ] Cancel session (soft delete) - - [ ] Session not found after TTL expires - - [ ] Unauthorized access blocked - - [ ] Event bus emissions verified - - [ ] Trace ingestion tests (8 tests) - - [ ] Batch trace ingest success - - [ ] Trace ingest with invalid session rejected - - [ ] Duplicate trace idempotency (upsert) - - [ ] Composite pk query by session - - [ ] Trace timing validation - - [ ] Parent-child span relationships - - [ ] Trace with error status - - [ ] Large batch rejected (>50 items) - - [ ] Log ingestion tests (8 tests) - - [ ] Batch log ingest success - - [ ] Log with PII redacted (email, SSN) - - [ ] Log level filtering - - [ ] Invalid session rejected - - [ ] Log search by message content - - [ ] Log context preservation - - [ ] Fatal log triggers alert - - [ ] Log TTL enforcement (3 days) - - [ ] Config polling tests (6 tests) - - [ ] Returns active session for device - - [ ] Returns empty when no active session - - [ ] ETag 304 caching works - - [ ] Rate limit enforced (5 sec) - - [ ] Wrong device cannot access other session - - [ ] Expired session not returned - - [ ] Screenshot tests (6 tests) - - [ ] SAS token generation via blob module - - [ ] Metadata stored in Cosmos - - [ ] Direct Azure Blob upload works - - [ ] Screenshot metadata retrieval - - [ ] Unauthorized access blocked - - [ ] Blob lifecycle tied to session TTL - - [ ] **Target:** 38+ Vitest tests (increased from 28) +- [x] **1.4.1** Create `modules/diagnostics/diagnostics.test.ts` — [`fb71981`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/fb71981) + - [x] Session CRUD tests (6 tests implemented, 4 pending) + - [x] Trace ingestion tests (2 tests implemented, 6 pending) + - [x] Log ingestion tests (3 tests implemented, 5 pending) + - [x] Schema validation tests (5 tests) + - [ ] Config polling tests (6 tests) — PENDING Phase 1.5 + - [ ] Screenshot tests (6 tests) — PENDING Phase 1.5 + - [x] **Target:** 14+ tests implemented (38 target for full Phase 1) ### 1.5 Integration -- [ ] **1.5.1** Wire into `server.ts` - - [ ] Import `diagnosticsRoutes` from `./modules/diagnostics/routes.js` - - [ ] Register: `await app.register(diagnosticsRoutes, { prefix: '/api' })` - - [ ] Add after telemetry routes (logical grouping) +- [x] **1.5.1** Wire into `server.ts` — [`d444a8d`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/d444a8d) + - [x] Import `diagnosticsRoutes` from `./modules/diagnostics/routes.js` + - [x] Register: `await app.register(diagnosticsRoutes, { prefix: '/api' })` + - [x] Add after telemetry routes (logical grouping) - [ ] **1.5.2** Event Bus Integration (`lib/event-bus.ts`) - [ ] Subscribe to `diagnostics.session.created` → Send notification to target user (email/push) - [ ] Subscribe to `diagnostics.session.cancelled` → Notify admin who started session