# MCP + A2A — Implementation Plan (Execution-Ready) ## Objective Deliver a **safe, auditable, product-aware** MCP + A2A capability layer on top of existing ByteLyst services (primarily `platform-service` and `extraction-service`) so that agents can: - diagnose incidents (telemetry + diagnostics) - manage telemetry policies / rollouts - orchestrate remote diagnostics sessions - iterate on extraction tasks/prompts with eval loops - run repeatable ops workflows (jobs, flags, maintenance) This plan intentionally starts with a **minimal P0 slice** and expands in phases. ## Guiding constraints (must-haves) - **Product isolation** - Every tool call is scoped to an explicit `productId`. - **Auditability** - Every mutating tool call must produce an audit record (directly or via existing APIs). - **Least privilege** - Query tools available to viewer roles; mutating tools gated to admin/super_admin. - **Safety defaults** - Mutations support `dryRun` where feasible. - Diagnostic amplification (policies/sessions) must require an `expiresAt`. - **No new “shadow APIs”** - Prefer calling existing service endpoints; only add endpoints if the tool surface cannot be expressed otherwise. ## Phase 0 — Baseline readiness (1–2 days) ### Deliverables - A “tool surface” inventory mapped to existing REST endpoints and required headers. - A role matrix for tool authorization. - A request-id propagation and logging standard for MCP tool calls. ### Phase 0 checklist (definition of done) - Confirm whether MCP tool implementations will call **REST only** (preferred) or allow **direct Cosmos reads** for selected query paths. - Choose whether MCP is: - a new service/package in `learning_ai_common_plat`, or - colocated under `services/platform-service`. - Define the initial auth strategy: - JWT only, or - JWT + API tokens for automation. - Define tool-level authorization rules (viewer/admin/super_admin) and how they’re enforced. - Agree on a consistent product scoping rule: - `productId` is mandatory input for every tool call and sent as `x-product-id` downstream. ### Decisions to lock - **MCP server shape** - Single server with namespaces (`platform.*`, `extraction.*`) vs. two servers. - **Auth mechanism** - Primary: platform-service JWT. - Secondary: platform API tokens for trusted automation. ### Required invariants - `x-request-id` propagated on all downstream calls. - `x-product-id` required and validated for all calls. ## Phase 1 — MVP MCP server (P0 slice) (3–7 days) ### Goal Enable a Support/Ops agent to answer: _“What’s happening?”_ and _“Start a bounded diagnostics session.”_ ### Tool surface (MVP) #### Read-only (viewer) - `telemetry.queryEvents(filters)` - `telemetry.listClusters(filters)` - `telemetry.listPolicies()` - `telemetry.getMetrics()` - `diagnostics.getSession(sessionId)` - `diagnostics.listSessions(filters)` - `diagnostics.getLogs(sessionId, filters)` - `diagnostics.getTraces(sessionId, filters)` - `extraction.sidecarHealth()` ### Phase 1 prerequisites (to avoid hidden integration failures) - Align `@bytelyst/diagnostics-client` ingest endpoint with `platform-service`. - Today the service ingests via `POST /api/diagnostics/sessions/:id/logs|traces` (and session-scoped screenshot upload), while the client posts to `POST /api/diagnostics/ingest`. - Decision: update the client to use the session-scoped endpoints (no backwards-compat alias endpoint). - Decide whether `@bytelyst/telemetry-client` should become policy-aware by consuming `GET /api/telemetry/config`. - If yes, treat it as a Phase 1 deliverable (with caching + ETag). Otherwise, explicitly defer to Phase 2. #### Mutating (admin) - `telemetry.previewPolicy(targeting)` - `telemetry.createPolicy(input)` - requires `expiresAt` - `telemetry.updatePolicy(id, updates)` - `telemetry.updateClusterStatus(clusterId, pk, status)` - `diagnostics.createSession(target, config)` - requires `expiresAt` or `expiresInMinutes` - `diagnostics.updateSession(sessionId, updates)` - `diagnostics.cancelSession(sessionId)` #### Compound workflow tool (admin) - `support.createDebugPack(input)` - internally calls: - telemetry queries - optional diagnostics session create + polling - returns a single structured artifact: - `debugPackId`, `clusterRefs`, `sessionRefs`, and a markdown summary ### Output contracts (schemas) Define explicit JSON schemas (Zod or equivalent) for: - `TelemetryFilters` - `TelemetryPolicyInput` (requires expiry) - `DiagnosticsSessionTarget` - `DiagnosticsSessionConfig` - `DebugPackRequest` - `DebugPackResponse` ### Guardrails - Query limits: - default `limit` and max `limit` enforced at MCP layer. - Policy guardrails: - require `expiresAt` - require explicit `eventTypes/modules` - block wildcard collection unless super_admin - Diagnostics guardrails: - enforce max duration - enforce max capture volume per flush ### Acceptance criteria - A single agent can: - pull clusters for a product + time window - propose a telemetry policy and preview targeting - create an expiring policy - start a diagnostics session and retrieve data - generate a “Debug Pack” artifact ### Phase 1 engineering checklist (definition of done) - MCP layer enforces: - request-id propagation (`x-request-id`) - required product scoping (`productId`) - default query caps and maximum caps - expiry requirements for policies/sessions - Every mutating tool call produces an audit record (either: - by calling existing audit endpoints, or - by ensuring the underlying platform-service endpoint already records audit) - Tool names and inputs are documented and stable (no breaking renames during Phase 1) ## Phase 2 — A2A orchestration (1–2 weeks) ### Goal Turn multi-step support/ops workflows into **repeatable agent playbooks** with explicit handoffs. ### Standard agents - **DispatcherAgent** - **TelemetryAnalystAgent** - **DiagnosticsOrchestratorAgent** - **OpsExecutorAgent** - **ReportWriterAgent** ### Handoff artifacts - `SupportIncidentBrief` - `TelemetryFindings` - `DiagnosticsSessionPlan` - `OpsChangePlan` - `FinalIncidentReport` ### Acceptance criteria - “Support Debug Pack” runs end-to-end via: - Dispatcher → Telemetry Analyst → Diagnostics Orchestrator → Report Writer - Every handoff is persisted (even if only in logs initially) with stable IDs. ## Phase 3 — Extraction task iteration loop (1–3 weeks) ### Goal Make extraction prompt/task improvements safe, testable, and regression-resistant. ### MCP tools - `extraction.extract(text, taskId?, modelId?)` - `extraction.extractBatch(inputs)` - `extraction.submitJob(inputs, webhookUrl?)` - `extraction.getJob(jobId)` - `extraction.metrics()` / `extraction.cacheStats()` ### A2A workflow - **TaskDesignerAgent** drafts task prompt + examples - **EvalRunnerAgent** runs batch eval sets - **RegressionAgent** compares to baseline - **PublisherAgent** updates task registry + rollout ### Acceptance criteria - A single command/workflow can: - run eval suite - compute simple quality metrics (schema validity, required fields coverage) - produce a report and recommended next edit ## Phase 4 — Ops expansion (jobs/flags/maintenance/webhooks) (1–2 weeks) ### Tools - `jobs.list`, `jobs.trigger`, `jobs.listRuns` - `flags.list`, `flags.upsert`, `flags.evaluate` - `maintenance.get`, `maintenance.set` - `webhooks.listSubscriptions`, `webhooks.test`, `webhooks.rotateSecret` ### Acceptance criteria - “Ops Copilot” can safely execute a bounded change plan: - propose change - dry-run if supported - execute with audit - verify outcome ## Security & privacy checklist - Explicit `productId` on every tool call - Avoid returning raw PII in tool results - Ensure diagnostics redaction remains enforced server-side - Enforce expirations on policies and sessions - Rate limit MCP server endpoints ## Rollout strategy - Start with internal-only usage (super_admin). - Add admin roles once guardrails are proven. - Add viewer read-only access for broader teams. - Add product-specific namespaces only after platform namespaces stabilize. ## Work breakdown (suggested) - **Milestone A**: MVP MCP server + telemetry/diagnostics read-only - **Milestone B**: mutating tools + dry-run/expiry enforcement - **Milestone C**: `support.createDebugPack` compound tool - **Milestone D**: A2A runner + handoff schemas - **Milestone E**: extraction eval loop ## Open questions (need decisions) - Should the MCP server call services via: - service REST endpoints only, or - direct Cosmos reads for some query paths? - Where should A2A handoff artifacts be stored: - telemetry events, - a dedicated Cosmos container, - or both? - Do we want one MCP server repo/package, or colocated under `platform-service`?