learning_ai_common_plat/docs/roadmaps/AI_DIAGNOSTIC_ASSISTANT_ROADMAP.md
saravanakumardb1 a31fdfe55a feat(predictive-analytics): complete admin UI for churn prediction and health scoring
- Add health-dashboard page with 6-dimension health cards and anomaly detection
- Add predictive/at-risk page with user risk profiles and segmentation
- Add predictive/campaigns page with campaign management and stats
- Add predictive-client.ts API client with full type coverage
- Update all 3 roadmaps to reflect complete implementation status
2026-03-03 13:48:37 -08:00

21 KiB
Raw Blame History

AI Diagnostic Assistant — Implementation Roadmap

Module: platform-service/src/modules/ai-diagnostics/
Admin UI: /ops/ai-diagnostics/
Target: LLM-powered root cause analysis from telemetry + debug sessions
Estimated Effort: 23 weeks
Status: 🟡 Planning


Executive Summary

This roadmap delivers an AI-powered diagnostic assistant that analyzes error patterns, debug session data, and telemetry to automatically suggest root causes—like having a senior engineer on-call 24/7. Engineers can ask natural language questions like "Why did the iOS keyboard crash yesterday?" and receive AI-generated hypotheses with supporting evidence.

Key Differentiators vs. Manual Debugging

Feature Manual Debugging AI Diagnostic Assistant
Query SQL + log grep Natural language
Pattern Detection Hours of manual correlation AI finds hidden patterns
Context Assembly Check 5+ systems manually Auto-assembles timeline
Hypothesis Engineer intuition LLM-generated + evidence
Learning Per-engineer experience Accumulates across all sessions

Phase 1: Data Pipeline & Embeddings (Week 1)

Goal: Extract, normalize, and embed error data for semantic search and clustering.

1.1 Error Fingerprinting & Clustering

  • 1.1.1 Create modules/ai-diagnostics/types.ts

    • ErrorClusterDoc — grouped similar errors with signature
    • ErrorFingerprint — normalized stack trace hash
    • ClusterAnalysis — AI-generated pattern description
    • Zod schemas for all inputs

    Commit format: git commit -m "feat(ai-diagnostics): add error clustering types [1.1.1]"https://github.com/saravanakumardb1/learning_ai_common_plat/commit/<hash>

  • 1.1.2 Add Cosmos containers to cosmos-init.ts

    • error_clusters (pk: /productId, TTL: 90 days)
    • error_fingerprints (pk: /fingerprintHash, unique index)
    • diagnostic_insights (pk: /clusterId, AI-generated analyses)

    Commit format: git commit -m "feat(ai-diagnostics): add cosmos containers for error clustering [1.1.2]"

  • 1.1.3 Implement error normalization

    • Stack trace parsing (remove line numbers, file paths)
    • Message templating (replace UUIDs, timestamps, user IDs with placeholders)
    • Fingerprint generation (SHA-256 of normalized error)
    • Similarity scoring (Levenshtein for near-matches)

    Commit format: git commit -m "feat(ai-diagnostics): implement error normalization and fingerprinting [1.1.3]"

  • 1.2.1 Create embedding pipeline
    • Azure OpenAI text-embedding-3-small integration
    • Error message + stack trace → 1536-dim vector
    • Batch embedding job (100 errors at a time)
  • 1.2.2 Cosmos DB vector search setup
    • Store embeddings in error_clusters documents
    • Cosine similarity query function
    • Similar error lookup by vector distance
  • 1.2.3 Clustering algorithm
    • HDBSCAN for density-based clustering
    • DBSCAN fallback for smaller datasets
    • Auto-determine cluster count (no manual k)
    • Re-cluster nightly as new errors arrive

1.3 Telemetry Ingestion for Context

  • 1.3.1 Link telemetry to errors
    • correlationId propagation across services
    • 5-minute window: error → preceding telemetry events
    • Session state reconstruction (what user was doing)
  • 1.3.2 Enrich error context
    • Device info (OS version, model, memory)
    • App state (screen, feature flags, config)
    • Recent API calls (network trace from diagnostics)
    • Recent user actions (breadcrumb trail)

Phase 1 Exit Criteria:

  • Errors auto-clustered with 90%+ accuracy
  • Vector search returns semantically similar errors
  • 10,000+ historical errors embedded and clustered
  • Correlation pipeline links errors to telemetry context

Phase 2: LLM Analysis Engine (Week 12)

2.1 Prompt Engineering & Analysis Pipeline

  • 2.1.1 Create analysis prompts

    • ROOT_CAUSE_ANALYSIS prompt template

      Given this error cluster:
      - Error signature: {fingerprint}
      - Sample stack traces: {samples}
      - Common context: {deviceStats}, {appState}
      - Preceding events: {breadcrumbSummary}
      - Similar resolved issues: {relatedClusters}
      
      Analyze and provide:
      1. Likely root cause category (config, dependency, logic, resource, external)
      2. Specific hypothesis with reasoning
      3. Evidence confidence (high/medium/low)
      4. Suggested investigation steps
      5. Potential fix direction
      
    • PATTERN_SUMMARY prompt for cluster descriptions

    • COMPARATIVE_ANALYSIS for error vs. baseline

  • 2.1.2 LLM integration

    • Azure OpenAI GPT-4o-mini for analysis (cost-effective)
    • GPT-4o for complex multi-factor analysis
    • Response JSON schema enforcement
    • Retry logic with exponential backoff

2.2 Insight Generation Service

  • 2.2.1 Create modules/ai-diagnostics/analyzer.ts
    • analyzeCluster(clusterId) — full analysis workflow
    • generateInsight(errorContext) — single error analysis
    • compareClusters(clusterA, clusterB) — diff analysis
  • 2.2.2 Analysis workflow
    • Fetch cluster data + related telemetry
    • Build LLM context (respect token limits)
    • Call LLM with structured prompt
    • Parse and validate response
    • Store insight in diagnostic_insights
  • 2.2.3 Confidence scoring
    • Evidence count weighting
    • Similar resolved issue bonus
    • Recency decay (older patterns = lower confidence)
    • Multi-model consensus (if available)

2.3 Continuous Learning

  • 2.3.1 Feedback loop
    • Engineer feedback: "Was this insight helpful? 👍/👎"
    • Resolution tracking (link commits to clusters)
    • Confidence recalibration based on outcomes
  • 2.3.2 Pattern accumulation
    • "Known issues" database (manually curated)
    • Historical fix patterns (what solved similar issues)
    • Regression detection (old issue reappearing)

Phase 2 Exit Criteria:

  • LLM generates root cause hypotheses with evidence
  • Confidence scores align with actual resolution rates
  • Analysis completes in < 5 seconds for typical clusters
  • Feedback loop capturing engineer ratings

Phase 3: Natural Language Query Interface (Week 2)

3.1 Query Understanding

  • 3.1.1 Create modules/ai-diagnostics/query-parser.ts
    • Intent classification (root cause, pattern search, comparison, trend)
    • Entity extraction (product, time range, error type, user segment)
    • Temporal parsing ("yesterday", "last week", "since v2.1")
    • Constraint identification ("only iOS", "excluding beta users")
  • 3.1.2 Query patterns
    • Root cause: "Why did X happen?" → analyze cluster
    • Pattern search: "Show me similar crashes" → vector search
    • Comparison: "Did error rate increase after release?" → trend analysis
    • User impact: "How many users affected by Y?" → aggregation query

3.2 Query Execution Engine

  • 3.2.1 Query → data pipeline
    • Map entities to Cosmos queries
    • Fetch relevant clusters, telemetry, sessions
    • Assemble context for response generation
  • 3.2.2 Response generation
    • Direct answers for simple queries
    • AI-generated summaries for complex analysis
    • Data + visualization suggestions
    • Drill-down links for exploration

3.3 REST API Routes

  • 3.3.1 Create modules/ai-diagnostics/routes.ts
    • POST /ai-diagnostics/query — natural language question
    • GET /ai-diagnostics/clusters/:id/analysis — pre-computed insight
    • POST /ai-diagnostics/clusters/:id/analyze — trigger fresh analysis
    • GET /ai-diagnostics/suggestions — auto-suggested investigations
    • POST /ai-diagnostics/feedback — submit insight rating

Phase 3 Exit Criteria:

  • Natural language queries parse correctly (90%+ intent accuracy)
  • Query → response pipeline < 3 seconds
  • Complex queries return structured answers with evidence
  • API routes tested and documented

Phase 4: Admin Dashboard UI (Week 23)

4.1 AI Insights Page

  • 4.1.1 Create /ops/ai-diagnostics/page.tsx
    • Smart search bar (natural language input)
    • Suggested queries based on recent errors
    • Recent AI-generated insights list
    • Trending clusters (auto-detected anomalies)
  • 4.1.2 Query results view
    • AI-generated answer with confidence badge
    • Supporting evidence cards (cluster stats, sample errors)
    • Related debug sessions (linked traces)
    • Timeline visualization of error pattern
    • "Investigate further" actions

4.2 Cluster Detail with AI Analysis

  • 4.2.1 Enhance error cluster detail
    • AI-generated summary card ("This appears to be...")
    • Root cause hypothesis with confidence
    • Evidence breakdown (stack samples, device patterns, API failures)
    • Suggested fixes from similar resolved issues
    • "Request deeper analysis" button (GPT-4o)
  • 4.2.2 Interactive investigation
    • Compare with other clusters ("Show me similar issues")
    • Filter by context (OS version, app version, feature flags)
    • View affected user journeys (breadcrumb trails)

4.3 Proactive Alerts

  • 4.3.1 Anomaly detection
    • Auto-detect emerging error clusters
    • Spike in existing cluster frequency
    • New error types after releases
  • 4.3.2 AI-generated alerts
    • Slack/Teams notification with summary
    • "Investigate in AI Diagnostics" deep link
    • Auto-started debug session recommendations

Phase 4 Exit Criteria:

  • Admin can ask questions and get AI-generated answers
  • Cluster detail shows AI analysis with evidence
  • Proactive alerts for emerging issues
  • Full test coverage (UI + API)

Phase 5: Advanced Capabilities (Future)

5.1 Multi-Modal Analysis

  • Analyze screenshots from debug sessions for UI issues
  • Voice transcription analysis (for voice app errors)
  • Performance trace visualization with AI annotations

5.2 Predictive Diagnostics

  • Pre-crash pattern detection (warn before crash happens)
  • Resource exhaustion prediction (memory, disk, API quotas)
  • Config drift detection ("this setting combination often fails")

5.3 Self-Healing Suggestions

  • Auto-generated config recommendations
  • Feature flag rollback suggestions
  • Circuit breaker threshold recommendations

Implementation Tracking

Phase Task Status Commit
1.1 Error clustering types 4de0126
1.1 Cosmos containers 4de0126
1.1 Error normalization 8cdddd7
1.2 Embedding pipeline 50b7e22
1.2 Vector search setup 6b97476
1.2 Clustering algorithm 8951ab2
1.3 Telemetry linking 1ff0293
1.3 Error context enrichment 1ff0293
2.1 Analysis prompts 97b3ffb
2.1 LLM integration 97b3ffb
2.2 Insight generation service 97b3ffb
2.2 Analysis workflow 97b3ffb
2.2 Confidence scoring 97b3ffb
2.3 Feedback loop 97b3ffb
2.3 Pattern accumulation 97b3ffb
3.1 Query parser 71cbb57
3.1 Query patterns 71cbb57
3.2 Query execution 71cbb57
3.2 Response generation 71cbb57
3.3 REST API routes 1cba699
4.1 AI insights page 460ed6d
4.1 Query results view 460ed6d
4.2 Cluster detail 460ed6d
4.2 Interactive investigation eab8543
4.3 Proactive alerts eab8543

Legend: Not started | 🟡 In progress | Complete | ⏸️ Deferred


Quick Reference for Implementing Agent

📋 Full Roadmap: /Users/sd9235/code/mygh/learning_ai_common_plat/docs/roadmaps/AI_DIAGNOSTIC_ASSISTANT_ROADMAP.md

Key Files to Modify/Create:

services/platform-service/
├── src/
│   ├── modules/ai-diagnostics/
│   │   ├── types.ts              # [1.1.1] Error clustering types
│   │   ├── repository.ts         # [1.2] Data access layer
│   │   ├── analyzer.ts           # [2.2] LLM analysis engine
│   │   ├── query-parser.ts       # [3.1] NL query understanding
│   │   ├── query-executor.ts     # [3.2] Query execution
│   │   ├── routes.ts             # [3.3] REST API
│   │   └── ai-diagnostics.test.ts # Tests
│   ├── lib/
│   │   ├── cosmos-init.ts        # [1.1.2] Add containers
│   │   ├── embedding-client.ts   # [1.2.1] Azure OpenAI embeddings
│   │   └── pii-redaction.ts      # Reuse existing
│   └── server.ts                 # [3.3] Register routes
dashboards/admin-web/
├── src/
│   ├── app/(dashboard)/
│   │   ├── ai-diagnostics/
│   │   │   ├── page.tsx          # [4.1] Main insights page
│   │   │   └── [id]/
│   │   │       └── page.tsx      # [4.2] Cluster detail
│   ├── lib/
│   │   └── ai-diagnostics-client.ts # API client
│   └── components/
│       └── ai-diagnostics/       # Reusable components

Commit Message Format:

feat(ai-diagnostics): <description> [<task.code>]

Example:

git add services/platform-service/src/modules/ai-diagnostics/
git commit -m "feat(ai-diagnostics): add error clustering types and cosmos containers [1.1.1-1.1.2]"

Testing Requirements:

  • Unit tests: 20+ Vitest tests for clustering, embeddings, LLM responses
  • Integration tests: End-to-end query → analysis pipeline

Dependencies:

  • Telemetry module (error events)
  • Azure OpenAI (embeddings + GPT-4o)
  • Existing diagnostics module (optional linking)

ErrorClusterDoc

interface ErrorClusterDoc {
  id: string; // ec_<uuid>
  productId: string; // partition key
  fingerprintHash: string; // SHA-256 of normalized error

  // Cluster metadata
  firstSeenAt: string; // ISO 8601
  lastSeenAt: string;
  occurrenceCount: number; // Total occurrences
  uniqueUsers: number; // Affected user count

  // Error signature
  errorType: string; // Exception class/name
  messageTemplate: string; // Normalized message with placeholders
  stackSignature: string; // Normalized stack frames

  // Vector embedding for semantic search
  embedding: number[]; // 1536-dim from text-embedding-3-small
  embeddingVersion: string; // Model version for re-embedding

  // Context patterns (auto-extracted)
  commonContext: {
    osVersions: Array<{ version: string; count: number }>;
    appVersions: Array<{ version: string; count: number }>;
    deviceModels: Array<{ model: string; count: number }>;
    screenContexts: Array<{ screen: string; count: number }>;
  };

  // Related data
  relatedClusterIds: string[]; // Similar clusters (vector similarity)
  mergedIntoClusterId?: string; // If deduplicated

  // Resolution tracking
  status: 'active' | 'investigating' | 'resolved' | 'ignored';
  resolvedAt?: string;
  resolutionCommit?: string; // Link to fix

  // Timestamps
  createdAt: string;
  updatedAt: string;
  ttl: number; // 90 days
}

DiagnosticInsightDoc

interface DiagnosticInsightDoc {
  id: string; // di_<uuid>
  clusterId: string; // partition key (with productId)
  productId: string;

  // AI-generated analysis
  analysisType: 'root_cause' | 'pattern' | 'comparison' | 'trend';
  generatedAt: string;

  // LLM output
  rootCauseCategory: 'config' | 'dependency' | 'logic' | 'resource' | 'external' | 'unknown';
  hypothesis: string; // Natural language explanation
  reasoning: string; // Why LLM thinks this
  confidence: 'high' | 'medium' | 'low';
  confidenceScore: number; // 0.01.0

  // Evidence
  evidence: Array<{
    type:
      | 'stack_trace'
      | 'telemetry_pattern'
      | 'device_correlation'
      | 'api_failure'
      | 'similar_issue';
    description: string;
    strength: 'strong' | 'moderate' | 'weak';
    data: Record<string, unknown>;
  }>;

  // Suggested actions
  suggestedInvestigation: string[];
  potentialFixDirection?: string;
  similarResolvedIssues?: Array<{
    clusterId: string;
    resolution: string;
    confidence: number;
  }>;

  // Feedback
  feedbackStats: {
    helpful: number;
    notHelpful: number;
    engineerNotes: string[];
  };

  // LLM metadata
  modelUsed: string; // gpt-4o, gpt-4o-mini
  promptTokens: number;
  completionTokens: number;

  createdAt: string;
  ttl: number; // 90 days
}

NaturalLanguageQueryDoc

interface NaturalLanguageQueryDoc {
  id: string; // nq_<uuid>
  userId: string; // Admin who asked
  productId?: string; // Optional filter

  // Query
  rawQuery: string; // "Why did iOS keyboard crash yesterday?"
  parsedIntent: 'root_cause' | 'pattern_search' | 'comparison' | 'trend' | 'impact';
  extractedEntities: {
    products?: string[];
    timeRange?: { start: string; end: string };
    errorTypes?: string[];
    platforms?: string[];
    userSegments?: string[];
  };

  // Execution
  executedQuery: string; // Translated Cosmos query
  dataSources: string[]; // Clusters, telemetry, sessions accessed
  executionTimeMs: number;

  // Response
  aiResponse: string; // Generated answer
  confidence: number; // Overall confidence
  supportingData: Array<{
    type: 'cluster' | 'telemetry' | 'session';
    id: string;
    relevanceScore: number;
  }>;

  // Feedback
  userRating?: 'helpful' | 'not_helpful';
  userComment?: string;

  createdAt: string;
  ttl: number; // 30 days
}

Appendix B: API Reference

Method Endpoint Auth Description
POST /ai-diagnostics/query Admin Natural language diagnostic query
GET /ai-diagnostics/clusters Admin List error clusters (with AI summaries)
GET /ai-diagnostics/clusters/:id Admin Cluster detail with AI analysis
POST /ai-diagnostics/clusters/:id/analyze Admin Trigger fresh LLM analysis
GET /ai-diagnostics/clusters/:id/analysis Admin Get pre-computed insight
GET /ai-diagnostics/suggestions Admin AI-suggested investigations
POST /ai-diagnostics/feedback Admin Rate insight helpfulness
POST /ai-diagnostics/search Admin Semantic search across errors

Appendix C: Integration Points

With Telemetry Module

  • Error events auto-create/update clusters
  • Telemetry context enriches error analysis
  • Correlation IDs link errors to user journeys

With Diagnostics Module

  • Debug sessions linked to error clusters
  • Screenshots from sessions aid visual analysis
  • Network traces provide API failure context

With Event Bus

Event Action
telemetry.error.ingested Update/create cluster, trigger re-analysis if new pattern
diagnostics.session.completed Link session to related clusters, analyze captured logs
diagnostics.ingest.fatal High-priority cluster analysis, alert if novel pattern

Appendix D: Cost Estimation

Component Monthly Cost (est.)
Azure OpenAI embeddings $50100 (10K errors/day)
GPT-4o-mini analysis $100200 (1K analyses/day)
GPT-4o deep analysis $50100 (100 deep analyses/day)
Cosmos DB vector storage $2050
Total $220450/month

Optimization:

  • Cache frequent cluster analyses (24hr TTL)
  • Use GPT-4o-mini for 90% of queries
  • Batch embedding jobs during off-peak

Current Status

  • Design complete — 2026-03-03
  • Phase 1: Data Pipeline — Complete
  • Phase 2: LLM Engine — Complete
  • Phase 3: Query Interface — Complete
  • Phase 4: Admin UI — Complete
  • Phase 5: Advanced Capabilities — Future

Estimated Timeline: COMPLETE (Phases 14)

Dependencies:

  • Telemetry module (must be collecting errors)
  • Diagnostics module (optional, for rich context)
  • Azure OpenAI deployment (embedding + GPT-4o access)

Last Updated: 2026-03-03