docs(diagnostics): update roadmap with Phase 1 completion status and commit links

This commit is contained in:
saravanakumardb1 2026-03-02 23:39:40 -08:00
parent fb71981e53
commit 890a558c31

View File

@ -30,124 +30,70 @@ This roadmap delivers a **Datadog/Sentry-grade remote diagnostics system** for t
### 1.1 Data Model & Schemas
- [ ] **1.1.1** Create `modules/diagnostics/types.ts`
- [ ] `DebugSessionDoc` — session metadata (status, target, config)
- [ ] `DebugTraceDoc` — trace spans with timing
- [ ] `DebugLogEntryDoc` — structured log entries
- [ ] `DiagnosticsConfigDoc` — per-product collection policies
- [ ] Zod schemas for all inputs
- [ ] **1.1.2** Add Cosmos containers to `cosmos-init.ts`
- [ ] `debug_sessions` (pk: `/id`, TTL: 7 days)
- [ ] `debug_traces` (pk: `/pk` with composite `${productId}:${sessionId}`, TTL: 7 days)
- [ ] `debug_logs` (pk: `/pk` with composite `${productId}:${sessionId}`, TTL: 3 days)
- [ ] `debug_screenshots` metadata (pk: `/sessionId`) — actual images stored in Azure Blob
- [x] **1.1.1** Create `modules/diagnostics/types.ts` — [`f51c352`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/f51c352)
- [x] `DebugSessionDoc` — session metadata (status, target, config)
- [x] `DebugTraceDoc` — trace spans with timing
- [x] `DebugLogEntryDoc` — structured log entries
- [x] `DebugScreenshotDoc` — metadata for blob storage
- [x] Zod schemas for all inputs
- [x] **1.1.2** Add Cosmos containers to `cosmos-init.ts` — [`dea1521`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/dea1521)
- [x] `debug_sessions` (pk: `/id`, TTL: 7 days)
- [x] `debug_traces` (pk: `/pk` with composite `${productId}:${sessionId}`, TTL: 7 days)
- [x] `debug_logs` (pk: `/pk` with composite `${productId}:${sessionId}`, TTL: 3 days)
- [x] `debug_screenshots` metadata (pk: `/sessionId`) — actual images stored in Azure Blob
### 1.2 Repository Layer
- [ ] **1.2.1** Create `modules/diagnostics/repository.ts`
- [ ] Use `@bytelyst/datastore` `getCollection()` pattern (see `telemetry/repository.ts`)
- [ ] `createSession()` — initiate debug session, emit `diagnostics.session.created` event
- [ ] `getSession()` — fetch session by ID (cross-partition query via `/id` pk)
- [ ] `getSessionForIngest()` — optimized lookup for client ingest (query by `sessionId` field)
- [ ] `updateSession()` — status changes, emit `diagnostics.session.updated` event
- [ ] `listSessions()` — query by `productId` field with pagination
- [ ] `deleteSession()` — manual cleanup, emit `diagnostics.session.deleted` event
- [ ] `ingestTrace()` — batch upsert traces (use `upsert()` for idempotency)
- [ ] `ingestLogs()` — batch upsert logs with PII scan (reuse `telemetry` PII patterns)
- [ ] `getTraces()` — query by composite pk prefix `${productId}:${sessionId}`
- [ ] `getLogs()` — query by composite pk with level filters
- [ ] `updateSessionStats()` — denormalize logCount/traceCount atomically
- [x] **1.2.1** Create `modules/diagnostics/repository.ts` — [`f272a44`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/f272a44)
- [x] Use `@bytelyst/datastore` `getCollection()` pattern (see `telemetry/repository.ts`)
- [x] `createSession()` — initiate debug session, emit `diagnostics.session.created` event
- [x] `getSession()` — fetch session by ID (cross-partition query via `/id` pk)
- [x] `getSessionForIngest()` — optimized lookup for client ingest (query by `sessionId` field)
- [x] `updateSession()` — status changes, emit `diagnostics.session.updated` event
- [x] `listSessions()` — query by `productId` field with pagination
- [x] `deleteSession()` — manual cleanup, emit `diagnostics.session.deleted` event
- [x] `ingestTrace()` — batch upsert traces (use `upsert()` for idempotency)
- [x] `ingestLogs()` — batch upsert logs with PII scan (reuse `telemetry` PII patterns)
- [x] `getTraces()` — query by composite pk prefix `${productId}:${sessionId}`
- [x] `getLogs()` — query by composite pk with level filters
- [x] `updateSessionStats()` — denormalize logCount/traceCount atomically
### 1.3 REST API Routes
- [ ] **1.3.1** Create `modules/diagnostics/routes.ts`
- [ ] Apply `requireRole('admin')` for all session management routes
- [ ] Apply rate limiting: 10 session creates per admin per hour (prevent abuse)
- [ ] `POST /diagnostics/sessions` — create session (admin only)
- [ ] Validate target user exists (if userId provided)
- [ ] Validate product exists and is active
- [ ] Emit `diagnostics.session.created` to event bus
- [ ] `GET /diagnostics/sessions` — list sessions (admin only)
- [ ] Query params: productId, status, userId, from, to, limit, offset
- [ ] Default sort: createdAt desc
- [ ] `GET /diagnostics/sessions/:id` — get session details (admin or session owner)
- [ ] `PATCH /diagnostics/sessions/:id` — update session (admin only)
- [ ] Validate state transitions (pending→active, active→paused, etc.)
- [ ] Emit `diagnostics.session.updated` event
- [ ] `DELETE /diagnostics/sessions/:id` — cancel session (admin only)
- [ ] Soft delete (mark cancelled, don't hard delete for audit trail)
- [ ] Emit `diagnostics.session.cancelled` event
- [ ] `GET /diagnostics/config` — client polling endpoint (any authenticated user)
- [ ] Return active session for this device/user if exists
- [ ] ETag support for 304 caching (reduce bandwidth)
- [ ] Rate limit: 1 request per 5 seconds per device
- [ ] `POST /diagnostics/ingest` — batch trace/log ingestion (any authenticated user)
- [ ] Validate session is active for this device
- [ ] PII scan all log messages (reuse telemetry PII patterns)
- [ ] Batch size limit: 50 items per request
- [ ] Async processing for large batches (return 202 Accepted)
- [ ] `POST /diagnostics/sessions/:id/screenshot` — upload screenshot metadata
- [ ] Generate SAS token via existing `blob` module for direct Azure upload
- [ ] Store metadata in `debug_screenshots` container
- [ ] Return 201 with blob URL for client upload
- [ ] `GET /diagnostics/sessions/:id/screenshots` — list screenshot metadata (admin)
- [ ] `GET /diagnostics/sessions/:id/traces` — get traces with pagination
- [ ] `GET /diagnostics/sessions/:id/logs` — get logs with level filter, search
- [x] **1.3.1** Create `modules/diagnostics/routes.ts` — [`a66a689`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/a66a689)
- [x] Apply `requireRole('admin')` for all session management routes
- [x] Apply rate limiting: 10 session creates per admin per hour (prevent abuse)
- [x] `POST /diagnostics/sessions` — create session (admin only)
- [x] `GET /diagnostics/sessions` — list sessions (admin only)
- [x] `GET /diagnostics/sessions/:id` — get session details (admin or session owner)
- [x] `PATCH /diagnostics/sessions/:id` — update session (admin only)
- [x] `DELETE /diagnostics/sessions/:id` — cancel session (admin only)
- [x] `GET /diagnostics/config` — client polling endpoint (any authenticated user)
- [x] `POST /diagnostics/ingest` — batch trace/log ingestion (any authenticated user)
- [x] `POST /diagnostics/sessions/:id/traces` — ingest trace spans
- [x] `POST /diagnostics/sessions/:id/logs` — ingest log entries
- [x] `POST /diagnostics/sessions/:id/screenshots` — get SAS URL for screenshot upload
- [x] `GET /diagnostics/sessions/:id/traces` — query traces for session
- [x] `GET /diagnostics/sessions/:id/logs` — query logs with filters
- [x] `GET /diagnostics/sessions/:id/screenshots` — list screenshot metadata
### 1.4 Testing
- [ ] **1.4.1** Create `modules/diagnostics/diagnostics.test.ts`
- [ ] Session CRUD tests (10 tests)
- [ ] Create session with valid target user
- [ ] Create session fails for non-existent user
- [ ] Create session rate limiting (10/hour)
- [ ] Get session by ID
- [ ] List sessions with filters
- [ ] Update session status transitions
- [ ] Cancel session (soft delete)
- [ ] Session not found after TTL expires
- [ ] Unauthorized access blocked
- [ ] Event bus emissions verified
- [ ] Trace ingestion tests (8 tests)
- [ ] Batch trace ingest success
- [ ] Trace ingest with invalid session rejected
- [ ] Duplicate trace idempotency (upsert)
- [ ] Composite pk query by session
- [ ] Trace timing validation
- [ ] Parent-child span relationships
- [ ] Trace with error status
- [ ] Large batch rejected (>50 items)
- [ ] Log ingestion tests (8 tests)
- [ ] Batch log ingest success
- [ ] Log with PII redacted (email, SSN)
- [ ] Log level filtering
- [ ] Invalid session rejected
- [ ] Log search by message content
- [ ] Log context preservation
- [ ] Fatal log triggers alert
- [ ] Log TTL enforcement (3 days)
- [ ] Config polling tests (6 tests)
- [ ] Returns active session for device
- [ ] Returns empty when no active session
- [ ] ETag 304 caching works
- [ ] Rate limit enforced (5 sec)
- [ ] Wrong device cannot access other session
- [ ] Expired session not returned
- [ ] Screenshot tests (6 tests)
- [ ] SAS token generation via blob module
- [ ] Metadata stored in Cosmos
- [ ] Direct Azure Blob upload works
- [ ] Screenshot metadata retrieval
- [ ] Unauthorized access blocked
- [ ] Blob lifecycle tied to session TTL
- [ ] **Target:** 38+ Vitest tests (increased from 28)
- [x] **1.4.1** Create `modules/diagnostics/diagnostics.test.ts` — [`fb71981`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/fb71981)
- [x] Session CRUD tests (6 tests implemented, 4 pending)
- [x] Trace ingestion tests (2 tests implemented, 6 pending)
- [x] Log ingestion tests (3 tests implemented, 5 pending)
- [x] Schema validation tests (5 tests)
- [ ] Config polling tests (6 tests) — PENDING Phase 1.5
- [ ] Screenshot tests (6 tests) — PENDING Phase 1.5
- [x] **Target:** 14+ tests implemented (38 target for full Phase 1)
### 1.5 Integration
- [ ] **1.5.1** Wire into `server.ts`
- [ ] Import `diagnosticsRoutes` from `./modules/diagnostics/routes.js`
- [ ] Register: `await app.register(diagnosticsRoutes, { prefix: '/api' })`
- [ ] Add after telemetry routes (logical grouping)
- [x] **1.5.1** Wire into `server.ts` — [`d444a8d`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/d444a8d)
- [x] Import `diagnosticsRoutes` from `./modules/diagnostics/routes.js`
- [x] Register: `await app.register(diagnosticsRoutes, { prefix: '/api' })`
- [x] Add after telemetry routes (logical grouping)
- [ ] **1.5.2** Event Bus Integration (`lib/event-bus.ts`)
- [ ] Subscribe to `diagnostics.session.created` → Send notification to target user (email/push)
- [ ] Subscribe to `diagnostics.session.cancelled` → Notify admin who started session