# AI Diagnostic Assistant β€” Implementation Roadmap > **Module:** `platform-service/src/modules/ai-diagnostics/` > **Admin UI:** `/ops/ai-diagnostics/` > **Target:** LLM-powered root cause analysis from telemetry + debug sessions > **Estimated Effort:** 2–3 weeks > **Status:** 🟑 Planning --- ## Executive Summary This roadmap delivers an **AI-powered diagnostic assistant** that analyzes error patterns, debug session data, and telemetry to automatically suggest root causesβ€”like having a senior engineer on-call 24/7. Engineers can ask natural language questions like _"Why did the iOS keyboard crash yesterday?"_ and receive AI-generated hypotheses with supporting evidence. ### Key Differentiators vs. Manual Debugging | Feature | Manual Debugging | AI Diagnostic Assistant | | ----------------- | --------------------------- | ----------------------------------- | | Query | SQL + log grep | **Natural language** | | Pattern Detection | Hours of manual correlation | **AI finds hidden patterns** | | Context Assembly | Check 5+ systems manually | **Auto-assembles timeline** | | Hypothesis | Engineer intuition | **LLM-generated + evidence** | | Learning | Per-engineer experience | **Accumulates across all sessions** | --- ## Phase 1: Data Pipeline & Embeddings (Week 1) **Goal:** Extract, normalize, and embed error data for semantic search and clustering. ### 1.1 Error Fingerprinting & Clustering - [ ] **1.1.1** Create `modules/ai-diagnostics/types.ts` - [ ] `ErrorClusterDoc` β€” grouped similar errors with signature - [ ] `ErrorFingerprint` β€” normalized stack trace hash - [ ] `ClusterAnalysis` β€” AI-generated pattern description - [ ] Zod schemas for all inputs _Commit format:_ `git commit -m "feat(ai-diagnostics): add error clustering types [1.1.1]"` β†’ `https://github.com/saravanakumardb1/learning_ai_common_plat/commit/` - [ ] **1.1.2** Add Cosmos containers to `cosmos-init.ts` - [ ] `error_clusters` (pk: `/productId`, TTL: 90 days) - [ ] `error_fingerprints` (pk: `/fingerprintHash`, unique index) - [ ] `diagnostic_insights` (pk: `/clusterId`, AI-generated analyses) _Commit format:_ `git commit -m "feat(ai-diagnostics): add cosmos containers for error clustering [1.1.2]"` - [ ] **1.1.3** Implement error normalization - [ ] Stack trace parsing (remove line numbers, file paths) - [ ] Message templating (replace UUIDs, timestamps, user IDs with placeholders) - [ ] Fingerprint generation (SHA-256 of normalized error) - [ ] Similarity scoring (Levenshtein for near-matches) _Commit format:_ `git commit -m "feat(ai-diagnostics): implement error normalization and fingerprinting [1.1.3]"` ### 1.2 Vector Embeddings for Semantic Search - [ ] **1.2.1** Create embedding pipeline - [ ] Azure OpenAI `text-embedding-3-small` integration - [ ] Error message + stack trace β†’ 1536-dim vector - [ ] Batch embedding job (100 errors at a time) - [ ] **1.2.2** Cosmos DB vector search setup - [ ] Store embeddings in `error_clusters` documents - [ ] Cosine similarity query function - [ ] Similar error lookup by vector distance - [ ] **1.2.3** Clustering algorithm - [ ] HDBSCAN for density-based clustering - [ ] DBSCAN fallback for smaller datasets - [ ] Auto-determine cluster count (no manual k) - [ ] Re-cluster nightly as new errors arrive ### 1.3 Telemetry Ingestion for Context - [ ] **1.3.1** Link telemetry to errors - [ ] `correlationId` propagation across services - [ ] 5-minute window: error β†’ preceding telemetry events - [ ] Session state reconstruction (what user was doing) - [ ] **1.3.2** Enrich error context - [ ] Device info (OS version, model, memory) - [ ] App state (screen, feature flags, config) - [ ] Recent API calls (network trace from diagnostics) - [ ] Recent user actions (breadcrumb trail) **Phase 1 Exit Criteria:** - [ ] Errors auto-clustered with 90%+ accuracy - [ ] Vector search returns semantically similar errors - [ ] 10,000+ historical errors embedded and clustered - [ ] Correlation pipeline links errors to telemetry context --- ## Phase 2: LLM Analysis Engine (Week 1–2) ### 2.1 Prompt Engineering & Analysis Pipeline - [ ] **2.1.1** Create analysis prompts - [ ] `ROOT_CAUSE_ANALYSIS` prompt template ``` Given this error cluster: - Error signature: {fingerprint} - Sample stack traces: {samples} - Common context: {deviceStats}, {appState} - Preceding events: {breadcrumbSummary} - Similar resolved issues: {relatedClusters} Analyze and provide: 1. Likely root cause category (config, dependency, logic, resource, external) 2. Specific hypothesis with reasoning 3. Evidence confidence (high/medium/low) 4. Suggested investigation steps 5. Potential fix direction ``` - [ ] `PATTERN_SUMMARY` prompt for cluster descriptions - [ ] `COMPARATIVE_ANALYSIS` for error vs. baseline - [ ] **2.1.2** LLM integration - [ ] Azure OpenAI GPT-4o-mini for analysis (cost-effective) - [ ] GPT-4o for complex multi-factor analysis - [ ] Response JSON schema enforcement - [ ] Retry logic with exponential backoff ### 2.2 Insight Generation Service - [ ] **2.2.1** Create `modules/ai-diagnostics/analyzer.ts` - [ ] `analyzeCluster(clusterId)` β€” full analysis workflow - [ ] `generateInsight(errorContext)` β€” single error analysis - [ ] `compareClusters(clusterA, clusterB)` β€” diff analysis - [ ] **2.2.2** Analysis workflow - [ ] Fetch cluster data + related telemetry - [ ] Build LLM context (respect token limits) - [ ] Call LLM with structured prompt - [ ] Parse and validate response - [ ] Store insight in `diagnostic_insights` - [ ] **2.2.3** Confidence scoring - [ ] Evidence count weighting - [ ] Similar resolved issue bonus - [ ] Recency decay (older patterns = lower confidence) - [ ] Multi-model consensus (if available) ### 2.3 Continuous Learning - [ ] **2.3.1** Feedback loop - [ ] Engineer feedback: "Was this insight helpful? πŸ‘/πŸ‘Ž" - [ ] Resolution tracking (link commits to clusters) - [ ] Confidence recalibration based on outcomes - [ ] **2.3.2** Pattern accumulation - [ ] "Known issues" database (manually curated) - [ ] Historical fix patterns (what solved similar issues) - [ ] Regression detection (old issue reappearing) **Phase 2 Exit Criteria:** - [ ] LLM generates root cause hypotheses with evidence - [ ] Confidence scores align with actual resolution rates - [ ] Analysis completes in < 5 seconds for typical clusters - [ ] Feedback loop capturing engineer ratings --- ## Phase 3: Natural Language Query Interface (Week 2) ### 3.1 Query Understanding - [ ] **3.1.1** Create `modules/ai-diagnostics/query-parser.ts` - [ ] Intent classification (root cause, pattern search, comparison, trend) - [ ] Entity extraction (product, time range, error type, user segment) - [ ] Temporal parsing ("yesterday", "last week", "since v2.1") - [ ] Constraint identification ("only iOS", "excluding beta users") - [ ] **3.1.2** Query patterns - [ ] Root cause: _"Why did X happen?"_ β†’ analyze cluster - [ ] Pattern search: _"Show me similar crashes"_ β†’ vector search - [ ] Comparison: _"Did error rate increase after release?"_ β†’ trend analysis - [ ] User impact: _"How many users affected by Y?"_ β†’ aggregation query ### 3.2 Query Execution Engine - [ ] **3.2.1** Query β†’ data pipeline - [ ] Map entities to Cosmos queries - [ ] Fetch relevant clusters, telemetry, sessions - [ ] Assemble context for response generation - [ ] **3.2.2** Response generation - [ ] Direct answers for simple queries - [ ] AI-generated summaries for complex analysis - [ ] Data + visualization suggestions - [ ] Drill-down links for exploration ### 3.3 REST API Routes - [ ] **3.3.1** Create `modules/ai-diagnostics/routes.ts` - [ ] `POST /ai-diagnostics/query` β€” natural language question - [ ] `GET /ai-diagnostics/clusters/:id/analysis` β€” pre-computed insight - [ ] `POST /ai-diagnostics/clusters/:id/analyze` β€” trigger fresh analysis - [ ] `GET /ai-diagnostics/suggestions` β€” auto-suggested investigations - [ ] `POST /ai-diagnostics/feedback` β€” submit insight rating **Phase 3 Exit Criteria:** - [ ] Natural language queries parse correctly (90%+ intent accuracy) - [ ] Query β†’ response pipeline < 3 seconds - [ ] Complex queries return structured answers with evidence - [ ] API routes tested and documented --- ## Phase 4: Admin Dashboard UI (Week 2–3) ### 4.1 AI Insights Page - [ ] **4.1.1** Create `/ops/ai-diagnostics/page.tsx` - [ ] Smart search bar (natural language input) - [ ] Suggested queries based on recent errors - [ ] Recent AI-generated insights list - [ ] Trending clusters (auto-detected anomalies) - [ ] **4.1.2** Query results view - [ ] AI-generated answer with confidence badge - [ ] Supporting evidence cards (cluster stats, sample errors) - [ ] Related debug sessions (linked traces) - [ ] Timeline visualization of error pattern - [ ] "Investigate further" actions ### 4.2 Cluster Detail with AI Analysis - [ ] **4.2.1** Enhance error cluster detail - [ ] AI-generated summary card ("This appears to be...") - [ ] Root cause hypothesis with confidence - [ ] Evidence breakdown (stack samples, device patterns, API failures) - [ ] Suggested fixes from similar resolved issues - [ ] "Request deeper analysis" button (GPT-4o) - [ ] **4.2.2** Interactive investigation - [ ] Compare with other clusters ("Show me similar issues") - [ ] Filter by context (OS version, app version, feature flags) - [ ] View affected user journeys (breadcrumb trails) ### 4.3 Proactive Alerts - [ ] **4.3.1** Anomaly detection - [ ] Auto-detect emerging error clusters - [ ] Spike in existing cluster frequency - [ ] New error types after releases - [ ] **4.3.2** AI-generated alerts - [ ] Slack/Teams notification with summary - [ ] "Investigate in AI Diagnostics" deep link - [ ] Auto-started debug session recommendations **Phase 4 Exit Criteria:** - [ ] Admin can ask questions and get AI-generated answers - [ ] Cluster detail shows AI analysis with evidence - [ ] Proactive alerts for emerging issues - [ ] Full test coverage (UI + API) --- ## Phase 5: Advanced Capabilities (Future) ### 5.1 Multi-Modal Analysis - [ ] Analyze screenshots from debug sessions for UI issues - [ ] Voice transcription analysis (for voice app errors) - [ ] Performance trace visualization with AI annotations ### 5.2 Predictive Diagnostics - [ ] Pre-crash pattern detection (warn before crash happens) - [ ] Resource exhaustion prediction (memory, disk, API quotas) - [ ] Config drift detection ("this setting combination often fails") ### 5.3 Self-Healing Suggestions - [ ] Auto-generated config recommendations - [ ] Feature flag rollback suggestions - [ ] Circuit breaker threshold recommendations ## Implementation Tracking | Phase | Task | Status | Commit | | ----- | -------------------------- | ------ | ------- | | 1.1 | Error clustering types | βœ… | 4de0126 | | 1.1 | Cosmos containers | βœ… | 4de0126 | | 1.1 | Error normalization | βœ… | 8cdddd7 | | 1.2 | Embedding pipeline | βœ… | 50b7e22 | | 1.2 | Vector search setup | βœ… | 6b97476 | | 1.2 | Clustering algorithm | βœ… | 8951ab2 | | 1.3 | Telemetry linking | βœ… | 1ff0293 | | 1.3 | Error context enrichment | βœ… | 1ff0293 | | 2.1 | Analysis prompts | βœ… | 97b3ffb | | 2.1 | LLM integration | βœ… | 97b3ffb | | 2.2 | Insight generation service | βœ… | 97b3ffb | | 2.2 | Analysis workflow | βœ… | 97b3ffb | | 2.2 | Confidence scoring | βœ… | 97b3ffb | | 2.3 | Feedback loop | βœ… | 97b3ffb | | 2.3 | Pattern accumulation | βœ… | 97b3ffb | | 3.1 | Query parser | βœ… | 71cbb57 | | 3.1 | Query patterns | βœ… | 71cbb57 | | 3.2 | Query execution | βœ… | 71cbb57 | | 3.2 | Response generation | βœ… | 71cbb57 | | 3.3 | REST API routes | βœ… | 1cba699 | | 4.1 | AI insights page | βœ… | 460ed6d | | 4.1 | Query results view | βœ… | 460ed6d | | 4.2 | Cluster detail | βœ… | 460ed6d | | 4.2 | Interactive investigation | βœ… | eab8543 | | 4.3 | Proactive alerts | βœ… | eab8543 | **Legend:** ⬜ Not started | 🟑 In progress | βœ… Complete | ⏸️ Deferred --- ## Quick Reference for Implementing Agent **πŸ“‹ Full Roadmap:** `/Users/sd9235/code/mygh/learning_ai_common_plat/docs/roadmaps/AI_DIAGNOSTIC_ASSISTANT_ROADMAP.md` **Key Files to Modify/Create:** ``` services/platform-service/ β”œβ”€β”€ src/ β”‚ β”œβ”€β”€ modules/ai-diagnostics/ β”‚ β”‚ β”œβ”€β”€ types.ts # [1.1.1] Error clustering types β”‚ β”‚ β”œβ”€β”€ repository.ts # [1.2] Data access layer β”‚ β”‚ β”œβ”€β”€ analyzer.ts # [2.2] LLM analysis engine β”‚ β”‚ β”œβ”€β”€ query-parser.ts # [3.1] NL query understanding β”‚ β”‚ β”œβ”€β”€ query-executor.ts # [3.2] Query execution β”‚ β”‚ β”œβ”€β”€ routes.ts # [3.3] REST API β”‚ β”‚ └── ai-diagnostics.test.ts # Tests β”‚ β”œβ”€β”€ lib/ β”‚ β”‚ β”œβ”€β”€ cosmos-init.ts # [1.1.2] Add containers β”‚ β”‚ β”œβ”€β”€ embedding-client.ts # [1.2.1] Azure OpenAI embeddings β”‚ β”‚ └── pii-redaction.ts # Reuse existing β”‚ └── server.ts # [3.3] Register routes dashboards/admin-web/ β”œβ”€β”€ src/ β”‚ β”œβ”€β”€ app/(dashboard)/ β”‚ β”‚ β”œβ”€β”€ ai-diagnostics/ β”‚ β”‚ β”‚ β”œβ”€β”€ page.tsx # [4.1] Main insights page β”‚ β”‚ β”‚ └── [id]/ β”‚ β”‚ β”‚ └── page.tsx # [4.2] Cluster detail β”‚ β”œβ”€β”€ lib/ β”‚ β”‚ └── ai-diagnostics-client.ts # API client β”‚ └── components/ β”‚ └── ai-diagnostics/ # Reusable components ``` **Commit Message Format:** ``` feat(ai-diagnostics): [] ``` **Example:** ```bash git add services/platform-service/src/modules/ai-diagnostics/ git commit -m "feat(ai-diagnostics): add error clustering types and cosmos containers [1.1.1-1.1.2]" ``` **Testing Requirements:** - Unit tests: 20+ Vitest tests for clustering, embeddings, LLM responses - Integration tests: End-to-end query β†’ analysis pipeline **Dependencies:** - Telemetry module (error events) - Azure OpenAI (embeddings + GPT-4o) - Existing diagnostics module (optional linking) --- ### ErrorClusterDoc ```typescript interface ErrorClusterDoc { id: string; // ec_ productId: string; // partition key fingerprintHash: string; // SHA-256 of normalized error // Cluster metadata firstSeenAt: string; // ISO 8601 lastSeenAt: string; occurrenceCount: number; // Total occurrences uniqueUsers: number; // Affected user count // Error signature errorType: string; // Exception class/name messageTemplate: string; // Normalized message with placeholders stackSignature: string; // Normalized stack frames // Vector embedding for semantic search embedding: number[]; // 1536-dim from text-embedding-3-small embeddingVersion: string; // Model version for re-embedding // Context patterns (auto-extracted) commonContext: { osVersions: Array<{ version: string; count: number }>; appVersions: Array<{ version: string; count: number }>; deviceModels: Array<{ model: string; count: number }>; screenContexts: Array<{ screen: string; count: number }>; }; // Related data relatedClusterIds: string[]; // Similar clusters (vector similarity) mergedIntoClusterId?: string; // If deduplicated // Resolution tracking status: 'active' | 'investigating' | 'resolved' | 'ignored'; resolvedAt?: string; resolutionCommit?: string; // Link to fix // Timestamps createdAt: string; updatedAt: string; ttl: number; // 90 days } ``` ### DiagnosticInsightDoc ```typescript interface DiagnosticInsightDoc { id: string; // di_ clusterId: string; // partition key (with productId) productId: string; // AI-generated analysis analysisType: 'root_cause' | 'pattern' | 'comparison' | 'trend'; generatedAt: string; // LLM output rootCauseCategory: 'config' | 'dependency' | 'logic' | 'resource' | 'external' | 'unknown'; hypothesis: string; // Natural language explanation reasoning: string; // Why LLM thinks this confidence: 'high' | 'medium' | 'low'; confidenceScore: number; // 0.0–1.0 // Evidence evidence: Array<{ type: | 'stack_trace' | 'telemetry_pattern' | 'device_correlation' | 'api_failure' | 'similar_issue'; description: string; strength: 'strong' | 'moderate' | 'weak'; data: Record; }>; // Suggested actions suggestedInvestigation: string[]; potentialFixDirection?: string; similarResolvedIssues?: Array<{ clusterId: string; resolution: string; confidence: number; }>; // Feedback feedbackStats: { helpful: number; notHelpful: number; engineerNotes: string[]; }; // LLM metadata modelUsed: string; // gpt-4o, gpt-4o-mini promptTokens: number; completionTokens: number; createdAt: string; ttl: number; // 90 days } ``` ### NaturalLanguageQueryDoc ```typescript interface NaturalLanguageQueryDoc { id: string; // nq_ userId: string; // Admin who asked productId?: string; // Optional filter // Query rawQuery: string; // "Why did iOS keyboard crash yesterday?" parsedIntent: 'root_cause' | 'pattern_search' | 'comparison' | 'trend' | 'impact'; extractedEntities: { products?: string[]; timeRange?: { start: string; end: string }; errorTypes?: string[]; platforms?: string[]; userSegments?: string[]; }; // Execution executedQuery: string; // Translated Cosmos query dataSources: string[]; // Clusters, telemetry, sessions accessed executionTimeMs: number; // Response aiResponse: string; // Generated answer confidence: number; // Overall confidence supportingData: Array<{ type: 'cluster' | 'telemetry' | 'session'; id: string; relevanceScore: number; }>; // Feedback userRating?: 'helpful' | 'not_helpful'; userComment?: string; createdAt: string; ttl: number; // 30 days } ``` --- ## Appendix B: API Reference | Method | Endpoint | Auth | Description | | ------ | --------------------------------------- | ----- | --------------------------------------- | | POST | `/ai-diagnostics/query` | Admin | Natural language diagnostic query | | GET | `/ai-diagnostics/clusters` | Admin | List error clusters (with AI summaries) | | GET | `/ai-diagnostics/clusters/:id` | Admin | Cluster detail with AI analysis | | POST | `/ai-diagnostics/clusters/:id/analyze` | Admin | Trigger fresh LLM analysis | | GET | `/ai-diagnostics/clusters/:id/analysis` | Admin | Get pre-computed insight | | GET | `/ai-diagnostics/suggestions` | Admin | AI-suggested investigations | | POST | `/ai-diagnostics/feedback` | Admin | Rate insight helpfulness | | POST | `/ai-diagnostics/search` | Admin | Semantic search across errors | --- ## Appendix C: Integration Points ### With Telemetry Module - Error events auto-create/update clusters - Telemetry context enriches error analysis - Correlation IDs link errors to user journeys ### With Diagnostics Module - Debug sessions linked to error clusters - Screenshots from sessions aid visual analysis - Network traces provide API failure context ### With Event Bus | Event | Action | | ------------------------------- | --------------------------------------------------------- | | `telemetry.error.ingested` | Update/create cluster, trigger re-analysis if new pattern | | `diagnostics.session.completed` | Link session to related clusters, analyze captured logs | | `diagnostics.ingest.fatal` | High-priority cluster analysis, alert if novel pattern | --- ## Appendix D: Cost Estimation | Component | Monthly Cost (est.) | | ------------------------ | ------------------------------- | | Azure OpenAI embeddings | $50–100 (10K errors/day) | | GPT-4o-mini analysis | $100–200 (1K analyses/day) | | GPT-4o deep analysis | $50–100 (100 deep analyses/day) | | Cosmos DB vector storage | $20–50 | | **Total** | **$220–450/month** | Optimization: - Cache frequent cluster analyses (24hr TTL) - Use GPT-4o-mini for 90% of queries - Batch embedding jobs during off-peak --- ## Current Status - [ ] **Design complete** β€” Target: 2026-03-10 - [ ] **Phase 1: Data Pipeline** β€” Not started - [ ] **Phase 2: LLM Engine** β€” Not started - [ ] **Phase 3: Query Interface** β€” Not started - [ ] **Phase 4: Admin UI** β€” Not started - [ ] **Phase 5: Advanced Capabilities** β€” Future **Estimated Timeline:** 2–3 weeks (Phases 1–4) **Dependencies:** - Telemetry module (must be collecting errors) - Diagnostics module (optional, for rich context) - Azure OpenAI deployment (embedding + GPT-4o access) --- _Last Updated: 2026-03-03_