- Add health-dashboard page with 6-dimension health cards and anomaly detection - Add predictive/at-risk page with user risk profiles and segmentation - Add predictive/campaigns page with campaign management and stats - Add predictive-client.ts API client with full type coverage - Update all 3 roadmaps to reflect complete implementation status
21 KiB
AI Diagnostic Assistant — Implementation Roadmap
Module:
platform-service/src/modules/ai-diagnostics/
Admin UI:/ops/ai-diagnostics/
Target: LLM-powered root cause analysis from telemetry + debug sessions
Estimated Effort: 2–3 weeks
Status: 🟡 Planning
Executive Summary
This roadmap delivers an AI-powered diagnostic assistant that analyzes error patterns, debug session data, and telemetry to automatically suggest root causes—like having a senior engineer on-call 24/7. Engineers can ask natural language questions like "Why did the iOS keyboard crash yesterday?" and receive AI-generated hypotheses with supporting evidence.
Key Differentiators vs. Manual Debugging
| Feature | Manual Debugging | AI Diagnostic Assistant |
|---|---|---|
| Query | SQL + log grep | Natural language |
| Pattern Detection | Hours of manual correlation | AI finds hidden patterns |
| Context Assembly | Check 5+ systems manually | Auto-assembles timeline |
| Hypothesis | Engineer intuition | LLM-generated + evidence |
| Learning | Per-engineer experience | Accumulates across all sessions |
Phase 1: Data Pipeline & Embeddings (Week 1)
Goal: Extract, normalize, and embed error data for semantic search and clustering.
1.1 Error Fingerprinting & Clustering
-
1.1.1 Create
modules/ai-diagnostics/types.tsErrorClusterDoc— grouped similar errors with signatureErrorFingerprint— normalized stack trace hashClusterAnalysis— AI-generated pattern description- Zod schemas for all inputs
Commit format:
git commit -m "feat(ai-diagnostics): add error clustering types [1.1.1]"→https://github.com/saravanakumardb1/learning_ai_common_plat/commit/<hash> -
1.1.2 Add Cosmos containers to
cosmos-init.tserror_clusters(pk:/productId, TTL: 90 days)error_fingerprints(pk:/fingerprintHash, unique index)diagnostic_insights(pk:/clusterId, AI-generated analyses)
Commit format:
git commit -m "feat(ai-diagnostics): add cosmos containers for error clustering [1.1.2]" -
1.1.3 Implement error normalization
- Stack trace parsing (remove line numbers, file paths)
- Message templating (replace UUIDs, timestamps, user IDs with placeholders)
- Fingerprint generation (SHA-256 of normalized error)
- Similarity scoring (Levenshtein for near-matches)
Commit format:
git commit -m "feat(ai-diagnostics): implement error normalization and fingerprinting [1.1.3]"
1.2 Vector Embeddings for Semantic Search
- 1.2.1 Create embedding pipeline
- Azure OpenAI
text-embedding-3-smallintegration - Error message + stack trace → 1536-dim vector
- Batch embedding job (100 errors at a time)
- Azure OpenAI
- 1.2.2 Cosmos DB vector search setup
- Store embeddings in
error_clustersdocuments - Cosine similarity query function
- Similar error lookup by vector distance
- Store embeddings in
- 1.2.3 Clustering algorithm
- HDBSCAN for density-based clustering
- DBSCAN fallback for smaller datasets
- Auto-determine cluster count (no manual k)
- Re-cluster nightly as new errors arrive
1.3 Telemetry Ingestion for Context
- 1.3.1 Link telemetry to errors
correlationIdpropagation across services- 5-minute window: error → preceding telemetry events
- Session state reconstruction (what user was doing)
- 1.3.2 Enrich error context
- Device info (OS version, model, memory)
- App state (screen, feature flags, config)
- Recent API calls (network trace from diagnostics)
- Recent user actions (breadcrumb trail)
Phase 1 Exit Criteria:
- Errors auto-clustered with 90%+ accuracy
- Vector search returns semantically similar errors
- 10,000+ historical errors embedded and clustered
- Correlation pipeline links errors to telemetry context
Phase 2: LLM Analysis Engine (Week 1–2)
2.1 Prompt Engineering & Analysis Pipeline
-
2.1.1 Create analysis prompts
-
ROOT_CAUSE_ANALYSISprompt templateGiven this error cluster: - Error signature: {fingerprint} - Sample stack traces: {samples} - Common context: {deviceStats}, {appState} - Preceding events: {breadcrumbSummary} - Similar resolved issues: {relatedClusters} Analyze and provide: 1. Likely root cause category (config, dependency, logic, resource, external) 2. Specific hypothesis with reasoning 3. Evidence confidence (high/medium/low) 4. Suggested investigation steps 5. Potential fix direction -
PATTERN_SUMMARYprompt for cluster descriptions -
COMPARATIVE_ANALYSISfor error vs. baseline
-
-
2.1.2 LLM integration
- Azure OpenAI GPT-4o-mini for analysis (cost-effective)
- GPT-4o for complex multi-factor analysis
- Response JSON schema enforcement
- Retry logic with exponential backoff
2.2 Insight Generation Service
- 2.2.1 Create
modules/ai-diagnostics/analyzer.tsanalyzeCluster(clusterId)— full analysis workflowgenerateInsight(errorContext)— single error analysiscompareClusters(clusterA, clusterB)— diff analysis
- 2.2.2 Analysis workflow
- Fetch cluster data + related telemetry
- Build LLM context (respect token limits)
- Call LLM with structured prompt
- Parse and validate response
- Store insight in
diagnostic_insights
- 2.2.3 Confidence scoring
- Evidence count weighting
- Similar resolved issue bonus
- Recency decay (older patterns = lower confidence)
- Multi-model consensus (if available)
2.3 Continuous Learning
- 2.3.1 Feedback loop
- Engineer feedback: "Was this insight helpful? 👍/👎"
- Resolution tracking (link commits to clusters)
- Confidence recalibration based on outcomes
- 2.3.2 Pattern accumulation
- "Known issues" database (manually curated)
- Historical fix patterns (what solved similar issues)
- Regression detection (old issue reappearing)
Phase 2 Exit Criteria:
- LLM generates root cause hypotheses with evidence
- Confidence scores align with actual resolution rates
- Analysis completes in < 5 seconds for typical clusters
- Feedback loop capturing engineer ratings
Phase 3: Natural Language Query Interface (Week 2)
3.1 Query Understanding
- 3.1.1 Create
modules/ai-diagnostics/query-parser.ts- Intent classification (root cause, pattern search, comparison, trend)
- Entity extraction (product, time range, error type, user segment)
- Temporal parsing ("yesterday", "last week", "since v2.1")
- Constraint identification ("only iOS", "excluding beta users")
- 3.1.2 Query patterns
- Root cause: "Why did X happen?" → analyze cluster
- Pattern search: "Show me similar crashes" → vector search
- Comparison: "Did error rate increase after release?" → trend analysis
- User impact: "How many users affected by Y?" → aggregation query
3.2 Query Execution Engine
- 3.2.1 Query → data pipeline
- Map entities to Cosmos queries
- Fetch relevant clusters, telemetry, sessions
- Assemble context for response generation
- 3.2.2 Response generation
- Direct answers for simple queries
- AI-generated summaries for complex analysis
- Data + visualization suggestions
- Drill-down links for exploration
3.3 REST API Routes
- 3.3.1 Create
modules/ai-diagnostics/routes.tsPOST /ai-diagnostics/query— natural language questionGET /ai-diagnostics/clusters/:id/analysis— pre-computed insightPOST /ai-diagnostics/clusters/:id/analyze— trigger fresh analysisGET /ai-diagnostics/suggestions— auto-suggested investigationsPOST /ai-diagnostics/feedback— submit insight rating
Phase 3 Exit Criteria:
- Natural language queries parse correctly (90%+ intent accuracy)
- Query → response pipeline < 3 seconds
- Complex queries return structured answers with evidence
- API routes tested and documented
Phase 4: Admin Dashboard UI (Week 2–3)
4.1 AI Insights Page
- 4.1.1 Create
/ops/ai-diagnostics/page.tsx- Smart search bar (natural language input)
- Suggested queries based on recent errors
- Recent AI-generated insights list
- Trending clusters (auto-detected anomalies)
- 4.1.2 Query results view
- AI-generated answer with confidence badge
- Supporting evidence cards (cluster stats, sample errors)
- Related debug sessions (linked traces)
- Timeline visualization of error pattern
- "Investigate further" actions
4.2 Cluster Detail with AI Analysis
- 4.2.1 Enhance error cluster detail
- AI-generated summary card ("This appears to be...")
- Root cause hypothesis with confidence
- Evidence breakdown (stack samples, device patterns, API failures)
- Suggested fixes from similar resolved issues
- "Request deeper analysis" button (GPT-4o)
- 4.2.2 Interactive investigation
- Compare with other clusters ("Show me similar issues")
- Filter by context (OS version, app version, feature flags)
- View affected user journeys (breadcrumb trails)
4.3 Proactive Alerts
- 4.3.1 Anomaly detection
- Auto-detect emerging error clusters
- Spike in existing cluster frequency
- New error types after releases
- 4.3.2 AI-generated alerts
- Slack/Teams notification with summary
- "Investigate in AI Diagnostics" deep link
- Auto-started debug session recommendations
Phase 4 Exit Criteria:
- Admin can ask questions and get AI-generated answers
- Cluster detail shows AI analysis with evidence
- Proactive alerts for emerging issues
- Full test coverage (UI + API)
Phase 5: Advanced Capabilities (Future)
5.1 Multi-Modal Analysis
- Analyze screenshots from debug sessions for UI issues
- Voice transcription analysis (for voice app errors)
- Performance trace visualization with AI annotations
5.2 Predictive Diagnostics
- Pre-crash pattern detection (warn before crash happens)
- Resource exhaustion prediction (memory, disk, API quotas)
- Config drift detection ("this setting combination often fails")
5.3 Self-Healing Suggestions
- Auto-generated config recommendations
- Feature flag rollback suggestions
- Circuit breaker threshold recommendations
Implementation Tracking
| Phase | Task | Status | Commit |
|---|---|---|---|
| 1.1 | Error clustering types | ✅ | 4de0126 |
| 1.1 | Cosmos containers | ✅ | 4de0126 |
| 1.1 | Error normalization | ✅ | 8cdddd7 |
| 1.2 | Embedding pipeline | ✅ | 50b7e22 |
| 1.2 | Vector search setup | ✅ | 6b97476 |
| 1.2 | Clustering algorithm | ✅ | 8951ab2 |
| 1.3 | Telemetry linking | ✅ | 1ff0293 |
| 1.3 | Error context enrichment | ✅ | 1ff0293 |
| 2.1 | Analysis prompts | ✅ | 97b3ffb |
| 2.1 | LLM integration | ✅ | 97b3ffb |
| 2.2 | Insight generation service | ✅ | 97b3ffb |
| 2.2 | Analysis workflow | ✅ | 97b3ffb |
| 2.2 | Confidence scoring | ✅ | 97b3ffb |
| 2.3 | Feedback loop | ✅ | 97b3ffb |
| 2.3 | Pattern accumulation | ✅ | 97b3ffb |
| 3.1 | Query parser | ✅ | 71cbb57 |
| 3.1 | Query patterns | ✅ | 71cbb57 |
| 3.2 | Query execution | ✅ | 71cbb57 |
| 3.2 | Response generation | ✅ | 71cbb57 |
| 3.3 | REST API routes | ✅ | 1cba699 |
| 4.1 | AI insights page | ✅ | 460ed6d |
| 4.1 | Query results view | ✅ | 460ed6d |
| 4.2 | Cluster detail | ✅ | 460ed6d |
| 4.2 | Interactive investigation | ✅ | eab8543 |
| 4.3 | Proactive alerts | ✅ | eab8543 |
Legend: ⬜ Not started | 🟡 In progress | ✅ Complete | ⏸️ Deferred
Quick Reference for Implementing Agent
📋 Full Roadmap: /Users/sd9235/code/mygh/learning_ai_common_plat/docs/roadmaps/AI_DIAGNOSTIC_ASSISTANT_ROADMAP.md
Key Files to Modify/Create:
services/platform-service/
├── src/
│ ├── modules/ai-diagnostics/
│ │ ├── types.ts # [1.1.1] Error clustering types
│ │ ├── repository.ts # [1.2] Data access layer
│ │ ├── analyzer.ts # [2.2] LLM analysis engine
│ │ ├── query-parser.ts # [3.1] NL query understanding
│ │ ├── query-executor.ts # [3.2] Query execution
│ │ ├── routes.ts # [3.3] REST API
│ │ └── ai-diagnostics.test.ts # Tests
│ ├── lib/
│ │ ├── cosmos-init.ts # [1.1.2] Add containers
│ │ ├── embedding-client.ts # [1.2.1] Azure OpenAI embeddings
│ │ └── pii-redaction.ts # Reuse existing
│ └── server.ts # [3.3] Register routes
dashboards/admin-web/
├── src/
│ ├── app/(dashboard)/
│ │ ├── ai-diagnostics/
│ │ │ ├── page.tsx # [4.1] Main insights page
│ │ │ └── [id]/
│ │ │ └── page.tsx # [4.2] Cluster detail
│ ├── lib/
│ │ └── ai-diagnostics-client.ts # API client
│ └── components/
│ └── ai-diagnostics/ # Reusable components
Commit Message Format:
feat(ai-diagnostics): <description> [<task.code>]
Example:
git add services/platform-service/src/modules/ai-diagnostics/
git commit -m "feat(ai-diagnostics): add error clustering types and cosmos containers [1.1.1-1.1.2]"
Testing Requirements:
- Unit tests: 20+ Vitest tests for clustering, embeddings, LLM responses
- Integration tests: End-to-end query → analysis pipeline
Dependencies:
- Telemetry module (error events)
- Azure OpenAI (embeddings + GPT-4o)
- Existing diagnostics module (optional linking)
ErrorClusterDoc
interface ErrorClusterDoc {
id: string; // ec_<uuid>
productId: string; // partition key
fingerprintHash: string; // SHA-256 of normalized error
// Cluster metadata
firstSeenAt: string; // ISO 8601
lastSeenAt: string;
occurrenceCount: number; // Total occurrences
uniqueUsers: number; // Affected user count
// Error signature
errorType: string; // Exception class/name
messageTemplate: string; // Normalized message with placeholders
stackSignature: string; // Normalized stack frames
// Vector embedding for semantic search
embedding: number[]; // 1536-dim from text-embedding-3-small
embeddingVersion: string; // Model version for re-embedding
// Context patterns (auto-extracted)
commonContext: {
osVersions: Array<{ version: string; count: number }>;
appVersions: Array<{ version: string; count: number }>;
deviceModels: Array<{ model: string; count: number }>;
screenContexts: Array<{ screen: string; count: number }>;
};
// Related data
relatedClusterIds: string[]; // Similar clusters (vector similarity)
mergedIntoClusterId?: string; // If deduplicated
// Resolution tracking
status: 'active' | 'investigating' | 'resolved' | 'ignored';
resolvedAt?: string;
resolutionCommit?: string; // Link to fix
// Timestamps
createdAt: string;
updatedAt: string;
ttl: number; // 90 days
}
DiagnosticInsightDoc
interface DiagnosticInsightDoc {
id: string; // di_<uuid>
clusterId: string; // partition key (with productId)
productId: string;
// AI-generated analysis
analysisType: 'root_cause' | 'pattern' | 'comparison' | 'trend';
generatedAt: string;
// LLM output
rootCauseCategory: 'config' | 'dependency' | 'logic' | 'resource' | 'external' | 'unknown';
hypothesis: string; // Natural language explanation
reasoning: string; // Why LLM thinks this
confidence: 'high' | 'medium' | 'low';
confidenceScore: number; // 0.0–1.0
// Evidence
evidence: Array<{
type:
| 'stack_trace'
| 'telemetry_pattern'
| 'device_correlation'
| 'api_failure'
| 'similar_issue';
description: string;
strength: 'strong' | 'moderate' | 'weak';
data: Record<string, unknown>;
}>;
// Suggested actions
suggestedInvestigation: string[];
potentialFixDirection?: string;
similarResolvedIssues?: Array<{
clusterId: string;
resolution: string;
confidence: number;
}>;
// Feedback
feedbackStats: {
helpful: number;
notHelpful: number;
engineerNotes: string[];
};
// LLM metadata
modelUsed: string; // gpt-4o, gpt-4o-mini
promptTokens: number;
completionTokens: number;
createdAt: string;
ttl: number; // 90 days
}
NaturalLanguageQueryDoc
interface NaturalLanguageQueryDoc {
id: string; // nq_<uuid>
userId: string; // Admin who asked
productId?: string; // Optional filter
// Query
rawQuery: string; // "Why did iOS keyboard crash yesterday?"
parsedIntent: 'root_cause' | 'pattern_search' | 'comparison' | 'trend' | 'impact';
extractedEntities: {
products?: string[];
timeRange?: { start: string; end: string };
errorTypes?: string[];
platforms?: string[];
userSegments?: string[];
};
// Execution
executedQuery: string; // Translated Cosmos query
dataSources: string[]; // Clusters, telemetry, sessions accessed
executionTimeMs: number;
// Response
aiResponse: string; // Generated answer
confidence: number; // Overall confidence
supportingData: Array<{
type: 'cluster' | 'telemetry' | 'session';
id: string;
relevanceScore: number;
}>;
// Feedback
userRating?: 'helpful' | 'not_helpful';
userComment?: string;
createdAt: string;
ttl: number; // 30 days
}
Appendix B: API Reference
| Method | Endpoint | Auth | Description |
|---|---|---|---|
| POST | /ai-diagnostics/query |
Admin | Natural language diagnostic query |
| GET | /ai-diagnostics/clusters |
Admin | List error clusters (with AI summaries) |
| GET | /ai-diagnostics/clusters/:id |
Admin | Cluster detail with AI analysis |
| POST | /ai-diagnostics/clusters/:id/analyze |
Admin | Trigger fresh LLM analysis |
| GET | /ai-diagnostics/clusters/:id/analysis |
Admin | Get pre-computed insight |
| GET | /ai-diagnostics/suggestions |
Admin | AI-suggested investigations |
| POST | /ai-diagnostics/feedback |
Admin | Rate insight helpfulness |
| POST | /ai-diagnostics/search |
Admin | Semantic search across errors |
Appendix C: Integration Points
With Telemetry Module
- Error events auto-create/update clusters
- Telemetry context enriches error analysis
- Correlation IDs link errors to user journeys
With Diagnostics Module
- Debug sessions linked to error clusters
- Screenshots from sessions aid visual analysis
- Network traces provide API failure context
With Event Bus
| Event | Action |
|---|---|
telemetry.error.ingested |
Update/create cluster, trigger re-analysis if new pattern |
diagnostics.session.completed |
Link session to related clusters, analyze captured logs |
diagnostics.ingest.fatal |
High-priority cluster analysis, alert if novel pattern |
Appendix D: Cost Estimation
| Component | Monthly Cost (est.) |
|---|---|
| Azure OpenAI embeddings | $50–100 (10K errors/day) |
| GPT-4o-mini analysis | $100–200 (1K analyses/day) |
| GPT-4o deep analysis | $50–100 (100 deep analyses/day) |
| Cosmos DB vector storage | $20–50 |
| Total | $220–450/month |
Optimization:
- Cache frequent cluster analyses (24hr TTL)
- Use GPT-4o-mini for 90% of queries
- Batch embedding jobs during off-peak
Current Status
- Design complete — 2026-03-03
- Phase 1: Data Pipeline — Complete
- Phase 2: LLM Engine — Complete
- Phase 3: Query Interface — Complete
- Phase 4: Admin UI — Complete
- Phase 5: Advanced Capabilities — Future
Estimated Timeline: COMPLETE (Phases 1–4)
Dependencies:
- Telemetry module (must be collecting errors)
- Diagnostics module (optional, for rich context)
- Azure OpenAI deployment (embedding + GPT-4o access)
Last Updated: 2026-03-03