saravanakumardb1 a31fdfe55a feat(predictive-analytics): complete admin UI for churn prediction and health scoring

- Add health-dashboard page with 6-dimension health cards and anomaly detection
- Add predictive/at-risk page with user risk profiles and segmentation
- Add predictive/campaigns page with campaign management and stats
- Add predictive-client.ts API client with full type coverage
- Update all 3 roadmaps to reflect complete implementation status

2026-03-03 13:48:37 -08:00

21 KiB

Raw Blame History

AI Diagnostic Assistant — Implementation Roadmap

Module: platform-service/src/modules/ai-diagnostics/
Admin UI: /ops/ai-diagnostics/
Target: LLM-powered root cause analysis from telemetry + debug sessions
Estimated Effort: 2–3 weeks
Status: 🟡 Planning

Executive Summary

This roadmap delivers an AI-powered diagnostic assistant that analyzes error patterns, debug session data, and telemetry to automatically suggest root causes—like having a senior engineer on-call 24/7. Engineers can ask natural language questions like "Why did the iOS keyboard crash yesterday?" and receive AI-generated hypotheses with supporting evidence.

Key Differentiators vs. Manual Debugging

Feature	Manual Debugging	AI Diagnostic Assistant
Query	SQL + log grep	Natural language
Pattern Detection	Hours of manual correlation	AI finds hidden patterns
Context Assembly	Check 5+ systems manually	Auto-assembles timeline
Hypothesis	Engineer intuition	LLM-generated + evidence
Learning	Per-engineer experience	Accumulates across all sessions

Phase 1: Data Pipeline & Embeddings (Week 1)

Goal: Extract, normalize, and embed error data for semantic search and clustering.

1.1 Error Fingerprinting & Clustering

1.1.1 Create modules/ai-diagnostics/types.ts
- ErrorClusterDoc — grouped similar errors with signature
- ErrorFingerprint — normalized stack trace hash
- ClusterAnalysis — AI-generated pattern description
- Zod schemas for all inputs
Commit format: git commit -m "feat(ai-diagnostics): add error clustering types [1.1.1]" → https://github.com/saravanakumardb1/learning_ai_common_plat/commit/<hash>
1.1.2 Add Cosmos containers to cosmos-init.ts
- error_clusters (pk: /productId, TTL: 90 days)
- error_fingerprints (pk: /fingerprintHash, unique index)
- diagnostic_insights (pk: /clusterId, AI-generated analyses)
Commit format: git commit -m "feat(ai-diagnostics): add cosmos containers for error clustering [1.1.2]"
1.1.3 Implement error normalization
- Stack trace parsing (remove line numbers, file paths)
- Message templating (replace UUIDs, timestamps, user IDs with placeholders)
- Fingerprint generation (SHA-256 of normalized error)
- Similarity scoring (Levenshtein for near-matches)
Commit format: git commit -m "feat(ai-diagnostics): implement error normalization and fingerprinting [1.1.3]"

1.2 Vector Embeddings for Semantic Search

1.2.1 Create embedding pipeline
- Azure OpenAI text-embedding-3-small integration
- Error message + stack trace → 1536-dim vector
- Batch embedding job (100 errors at a time)
1.2.2 Cosmos DB vector search setup
- Store embeddings in error_clusters documents
- Cosine similarity query function
- Similar error lookup by vector distance
1.2.3 Clustering algorithm
- HDBSCAN for density-based clustering
- DBSCAN fallback for smaller datasets
- Auto-determine cluster count (no manual k)
- Re-cluster nightly as new errors arrive

1.3 Telemetry Ingestion for Context

1.3.1 Link telemetry to errors
- correlationId propagation across services
- 5-minute window: error → preceding telemetry events
- Session state reconstruction (what user was doing)
1.3.2 Enrich error context
- Device info (OS version, model, memory)
- App state (screen, feature flags, config)
- Recent API calls (network trace from diagnostics)
- Recent user actions (breadcrumb trail)

Phase 1 Exit Criteria:

Errors auto-clustered with 90%+ accuracy
Vector search returns semantically similar errors
10,000+ historical errors embedded and clustered
Correlation pipeline links errors to telemetry context

Phase 2: LLM Analysis Engine (Week 1–2)

2.1 Prompt Engineering & Analysis Pipeline

2.1.1 Create analysis prompts

ROOT_CAUSE_ANALYSIS prompt template

Given this error cluster:
- Error signature: {fingerprint}
- Sample stack traces: {samples}
- Common context: {deviceStats}, {appState}
- Preceding events: {breadcrumbSummary}
- Similar resolved issues: {relatedClusters}

Analyze and provide:
1. Likely root cause category (config, dependency, logic, resource, external)
2. Specific hypothesis with reasoning
3. Evidence confidence (high/medium/low)
4. Suggested investigation steps
5. Potential fix direction

PATTERN_SUMMARY prompt for cluster descriptions
COMPARATIVE_ANALYSIS for error vs. baseline

2.1.2 LLM integration
- Azure OpenAI GPT-4o-mini for analysis (cost-effective)
- GPT-4o for complex multi-factor analysis
- Response JSON schema enforcement
- Retry logic with exponential backoff

2.2 Insight Generation Service

2.2.1 Create modules/ai-diagnostics/analyzer.ts
- analyzeCluster(clusterId) — full analysis workflow
- generateInsight(errorContext) — single error analysis
- compareClusters(clusterA, clusterB) — diff analysis
2.2.2 Analysis workflow
- Fetch cluster data + related telemetry
- Build LLM context (respect token limits)
- Call LLM with structured prompt
- Parse and validate response
- Store insight in diagnostic_insights
2.2.3 Confidence scoring
- Evidence count weighting
- Similar resolved issue bonus
- Recency decay (older patterns = lower confidence)
- Multi-model consensus (if available)

2.3 Continuous Learning

2.3.1 Feedback loop
- Engineer feedback: "Was this insight helpful? 👍/👎"
- Resolution tracking (link commits to clusters)
- Confidence recalibration based on outcomes
2.3.2 Pattern accumulation
- "Known issues" database (manually curated)
- Historical fix patterns (what solved similar issues)
- Regression detection (old issue reappearing)

Phase 2 Exit Criteria:

LLM generates root cause hypotheses with evidence
Confidence scores align with actual resolution rates
Analysis completes in < 5 seconds for typical clusters
Feedback loop capturing engineer ratings

Phase 3: Natural Language Query Interface (Week 2)

3.1 Query Understanding

3.1.1 Create modules/ai-diagnostics/query-parser.ts
- Intent classification (root cause, pattern search, comparison, trend)
- Entity extraction (product, time range, error type, user segment)
- Temporal parsing ("yesterday", "last week", "since v2.1")
- Constraint identification ("only iOS", "excluding beta users")
3.1.2 Query patterns
- Root cause: "Why did X happen?" → analyze cluster
- Pattern search: "Show me similar crashes" → vector search
- Comparison: "Did error rate increase after release?" → trend analysis
- User impact: "How many users affected by Y?" → aggregation query

3.2 Query Execution Engine

3.2.1 Query → data pipeline
- Map entities to Cosmos queries
- Fetch relevant clusters, telemetry, sessions
- Assemble context for response generation
3.2.2 Response generation
- Direct answers for simple queries
- AI-generated summaries for complex analysis
- Data + visualization suggestions
- Drill-down links for exploration

3.3 REST API Routes

3.3.1 Create modules/ai-diagnostics/routes.ts
- POST /ai-diagnostics/query — natural language question
- GET /ai-diagnostics/clusters/:id/analysis — pre-computed insight
- POST /ai-diagnostics/clusters/:id/analyze — trigger fresh analysis
- GET /ai-diagnostics/suggestions — auto-suggested investigations
- POST /ai-diagnostics/feedback — submit insight rating

Phase 3 Exit Criteria:

Natural language queries parse correctly (90%+ intent accuracy)
Query → response pipeline < 3 seconds
Complex queries return structured answers with evidence
API routes tested and documented

Phase 4: Admin Dashboard UI (Week 2–3)

4.1 AI Insights Page

4.1.1 Create /ops/ai-diagnostics/page.tsx
- Smart search bar (natural language input)
- Suggested queries based on recent errors
- Recent AI-generated insights list
- Trending clusters (auto-detected anomalies)
4.1.2 Query results view
- AI-generated answer with confidence badge
- Supporting evidence cards (cluster stats, sample errors)
- Related debug sessions (linked traces)
- Timeline visualization of error pattern
- "Investigate further" actions

4.2 Cluster Detail with AI Analysis

4.2.1 Enhance error cluster detail
- AI-generated summary card ("This appears to be...")
- Root cause hypothesis with confidence
- Evidence breakdown (stack samples, device patterns, API failures)
- Suggested fixes from similar resolved issues
- "Request deeper analysis" button (GPT-4o)
4.2.2 Interactive investigation
- Compare with other clusters ("Show me similar issues")
- Filter by context (OS version, app version, feature flags)
- View affected user journeys (breadcrumb trails)

4.3 Proactive Alerts

4.3.1 Anomaly detection
- Auto-detect emerging error clusters
- Spike in existing cluster frequency
- New error types after releases
4.3.2 AI-generated alerts
- Slack/Teams notification with summary
- "Investigate in AI Diagnostics" deep link
- Auto-started debug session recommendations

Phase 4 Exit Criteria:

Admin can ask questions and get AI-generated answers
Cluster detail shows AI analysis with evidence
Proactive alerts for emerging issues
Full test coverage (UI + API)

Phase 5: Advanced Capabilities (Future)

Analyze screenshots from debug sessions for UI issues
Voice transcription analysis (for voice app errors)
Performance trace visualization with AI annotations

5.2 Predictive Diagnostics

Pre-crash pattern detection (warn before crash happens)
Resource exhaustion prediction (memory, disk, API quotas)
Config drift detection ("this setting combination often fails")

5.3 Self-Healing Suggestions

Auto-generated config recommendations
Feature flag rollback suggestions
Circuit breaker threshold recommendations

Implementation Tracking

Phase	Task	Status	Commit
1.1	Error clustering types	✅	`4de0126`
1.1	Cosmos containers	✅	`4de0126`
1.1	Error normalization	✅	`8cdddd7`
1.2	Embedding pipeline	✅	`50b7e22`
1.2	Vector search setup	✅	`6b97476`
1.2	Clustering algorithm	✅	`8951ab2`
1.3	Telemetry linking	✅	`1ff0293`
1.3	Error context enrichment	✅	`1ff0293`
2.1	Analysis prompts	✅	`97b3ffb`
2.1	LLM integration	✅	`97b3ffb`
2.2	Insight generation service	✅	`97b3ffb`
2.2	Analysis workflow	✅	`97b3ffb`
2.2	Confidence scoring	✅	`97b3ffb`
2.3	Feedback loop	✅	`97b3ffb`
2.3	Pattern accumulation	✅	`97b3ffb`
3.1	Query parser	✅	`71cbb57`
3.1	Query patterns	✅	`71cbb57`
3.2	Query execution	✅	`71cbb57`
3.2	Response generation	✅	`71cbb57`
3.3	REST API routes	✅	`1cba699`
4.1	AI insights page	✅	`460ed6d`
4.1	Query results view	✅	`460ed6d`
4.2	Cluster detail	✅	`460ed6d`
4.2	Interactive investigation	✅	`eab8543`
4.3	Proactive alerts	✅	`eab8543`

Legend: ⬜ Not started | 🟡 In progress | ✅ Complete | ⏸️ Deferred

Quick Reference for Implementing Agent

📋 Full Roadmap: /Users/sd9235/code/mygh/learning_ai_common_plat/docs/roadmaps/AI_DIAGNOSTIC_ASSISTANT_ROADMAP.md

Key Files to Modify/Create:

services/platform-service/
├── src/
│   ├── modules/ai-diagnostics/
│   │   ├── types.ts              # [1.1.1] Error clustering types
│   │   ├── repository.ts         # [1.2] Data access layer
│   │   ├── analyzer.ts           # [2.2] LLM analysis engine
│   │   ├── query-parser.ts       # [3.1] NL query understanding
│   │   ├── query-executor.ts     # [3.2] Query execution
│   │   ├── routes.ts             # [3.3] REST API
│   │   └── ai-diagnostics.test.ts # Tests
│   ├── lib/
│   │   ├── cosmos-init.ts        # [1.1.2] Add containers
│   │   ├── embedding-client.ts   # [1.2.1] Azure OpenAI embeddings
│   │   └── pii-redaction.ts      # Reuse existing
│   └── server.ts                 # [3.3] Register routes
dashboards/admin-web/
├── src/
│   ├── app/(dashboard)/
│   │   ├── ai-diagnostics/
│   │   │   ├── page.tsx          # [4.1] Main insights page
│   │   │   └── [id]/
│   │   │       └── page.tsx      # [4.2] Cluster detail
│   ├── lib/
│   │   └── ai-diagnostics-client.ts # API client
│   └── components/
│       └── ai-diagnostics/       # Reusable components

Commit Message Format:

feat(ai-diagnostics): <description> [<task.code>]

Example:

git add services/platform-service/src/modules/ai-diagnostics/
git commit -m "feat(ai-diagnostics): add error clustering types and cosmos containers [1.1.1-1.1.2]"

Testing Requirements:

Unit tests: 20+ Vitest tests for clustering, embeddings, LLM responses
Integration tests: End-to-end query → analysis pipeline

Dependencies:

Telemetry module (error events)
Azure OpenAI (embeddings + GPT-4o)
Existing diagnostics module (optional linking)

ErrorClusterDoc

interface ErrorClusterDoc {
  id: string; // ec_<uuid>
  productId: string; // partition key
  fingerprintHash: string; // SHA-256 of normalized error

  // Cluster metadata
  firstSeenAt: string; // ISO 8601
  lastSeenAt: string;
  occurrenceCount: number; // Total occurrences
  uniqueUsers: number; // Affected user count

  // Error signature
  errorType: string; // Exception class/name
  messageTemplate: string; // Normalized message with placeholders
  stackSignature: string; // Normalized stack frames

  // Vector embedding for semantic search
  embedding: number[]; // 1536-dim from text-embedding-3-small
  embeddingVersion: string; // Model version for re-embedding

  // Context patterns (auto-extracted)
  commonContext: {
    osVersions: Array<{ version: string; count: number }>;
    appVersions: Array<{ version: string; count: number }>;
    deviceModels: Array<{ model: string; count: number }>;
    screenContexts: Array<{ screen: string; count: number }>;
  };

  // Related data
  relatedClusterIds: string[]; // Similar clusters (vector similarity)
  mergedIntoClusterId?: string; // If deduplicated

  // Resolution tracking
  status: 'active' | 'investigating' | 'resolved' | 'ignored';
  resolvedAt?: string;
  resolutionCommit?: string; // Link to fix

  // Timestamps
  createdAt: string;
  updatedAt: string;
  ttl: number; // 90 days
}

DiagnosticInsightDoc

interface DiagnosticInsightDoc {
  id: string; // di_<uuid>
  clusterId: string; // partition key (with productId)
  productId: string;

  // AI-generated analysis
  analysisType: 'root_cause' | 'pattern' | 'comparison' | 'trend';
  generatedAt: string;

  // LLM output
  rootCauseCategory: 'config' | 'dependency' | 'logic' | 'resource' | 'external' | 'unknown';
  hypothesis: string; // Natural language explanation
  reasoning: string; // Why LLM thinks this
  confidence: 'high' | 'medium' | 'low';
  confidenceScore: number; // 0.0–1.0

  // Evidence
  evidence: Array<{
    type:
      | 'stack_trace'
      | 'telemetry_pattern'
      | 'device_correlation'
      | 'api_failure'
      | 'similar_issue';
    description: string;
    strength: 'strong' | 'moderate' | 'weak';
    data: Record<string, unknown>;
  }>;

  // Suggested actions
  suggestedInvestigation: string[];
  potentialFixDirection?: string;
  similarResolvedIssues?: Array<{
    clusterId: string;
    resolution: string;
    confidence: number;
  }>;

  // Feedback
  feedbackStats: {
    helpful: number;
    notHelpful: number;
    engineerNotes: string[];
  };

  // LLM metadata
  modelUsed: string; // gpt-4o, gpt-4o-mini
  promptTokens: number;
  completionTokens: number;

  createdAt: string;
  ttl: number; // 90 days
}

NaturalLanguageQueryDoc

interface NaturalLanguageQueryDoc {
  id: string; // nq_<uuid>
  userId: string; // Admin who asked
  productId?: string; // Optional filter

  // Query
  rawQuery: string; // "Why did iOS keyboard crash yesterday?"
  parsedIntent: 'root_cause' | 'pattern_search' | 'comparison' | 'trend' | 'impact';
  extractedEntities: {
    products?: string[];
    timeRange?: { start: string; end: string };
    errorTypes?: string[];
    platforms?: string[];
    userSegments?: string[];
  };

  // Execution
  executedQuery: string; // Translated Cosmos query
  dataSources: string[]; // Clusters, telemetry, sessions accessed
  executionTimeMs: number;

  // Response
  aiResponse: string; // Generated answer
  confidence: number; // Overall confidence
  supportingData: Array<{
    type: 'cluster' | 'telemetry' | 'session';
    id: string;
    relevanceScore: number;
  }>;

  // Feedback
  userRating?: 'helpful' | 'not_helpful';
  userComment?: string;

  createdAt: string;
  ttl: number; // 30 days
}

Appendix B: API Reference

Method	Endpoint	Auth	Description
POST	`/ai-diagnostics/query`	Admin	Natural language diagnostic query
GET	`/ai-diagnostics/clusters`	Admin	List error clusters (with AI summaries)
GET	`/ai-diagnostics/clusters/:id`	Admin	Cluster detail with AI analysis
POST	`/ai-diagnostics/clusters/:id/analyze`	Admin	Trigger fresh LLM analysis
GET	`/ai-diagnostics/clusters/:id/analysis`	Admin	Get pre-computed insight
GET	`/ai-diagnostics/suggestions`	Admin	AI-suggested investigations
POST	`/ai-diagnostics/feedback`	Admin	Rate insight helpfulness
POST	`/ai-diagnostics/search`	Admin	Semantic search across errors

Appendix C: Integration Points

With Telemetry Module

Error events auto-create/update clusters
Telemetry context enriches error analysis
Correlation IDs link errors to user journeys

With Diagnostics Module

Debug sessions linked to error clusters
Screenshots from sessions aid visual analysis
Network traces provide API failure context

With Event Bus

Event	Action
`telemetry.error.ingested`	Update/create cluster, trigger re-analysis if new pattern
`diagnostics.session.completed`	Link session to related clusters, analyze captured logs
`diagnostics.ingest.fatal`	High-priority cluster analysis, alert if novel pattern

Appendix D: Cost Estimation

Component	Monthly Cost (est.)
Azure OpenAI embeddings	$50–100 (10K errors/day)
GPT-4o-mini analysis	$100–200 (1K analyses/day)
GPT-4o deep analysis	$50–100 (100 deep analyses/day)
Cosmos DB vector storage	$20–50
Total	$220–450/month

Optimization:

Cache frequent cluster analyses (24hr TTL)
Use GPT-4o-mini for 90% of queries
Batch embedding jobs during off-peak

Current Status

Design complete — 2026-03-03
Phase 1: Data Pipeline — Complete
Phase 2: LLM Engine — Complete
Phase 3: Query Interface — Complete
Phase 4: Admin UI — Complete
Phase 5: Advanced Capabilities — Future

Estimated Timeline: COMPLETE (Phases 1–4)