# MCP + A2A — Implementation Plan (Execution-Ready)

## Objective

Deliver a **safe, auditable, product-aware** MCP + A2A capability layer on top of existing ByteLyst services (primarily `platform-service` and `extraction-service`) so that agents can:

- diagnose incidents (telemetry + diagnostics)
- manage telemetry policies / rollouts
- orchestrate remote diagnostics sessions
- iterate on extraction tasks/prompts with eval loops
- run repeatable ops workflows (jobs, flags, maintenance)

This plan intentionally starts with a **minimal P0 slice** and expands in phases.

## Guiding constraints (must-haves)

- **Product isolation**
  - Every tool call is scoped to an explicit `productId`.
- **Auditability**
  - Every mutating tool call must produce an audit record (directly or via existing APIs).
- **Least privilege**
  - Query tools available to viewer roles; mutating tools gated to admin/super_admin.
- **Safety defaults**
  - Mutations support `dryRun` where feasible.
  - Diagnostic amplification (policies/sessions) must require an `expiresAt`.
- **No new “shadow APIs”**
  - Prefer calling existing service endpoints; only add endpoints if the tool surface cannot be expressed otherwise.

## Phase 0 — Baseline readiness (1–2 days)

### Deliverables

- A “tool surface” inventory mapped to existing REST endpoints and required headers.
- A role matrix for tool authorization.
- A request-id propagation and logging standard for MCP tool calls.

### Phase 0 checklist (definition of done)

- Confirm whether MCP tool implementations will call **REST only** (preferred) or allow **direct Cosmos reads** for selected query paths.
- Choose whether MCP is:
  - a new service/package in `learning_ai_common_plat`, or
  - colocated under `services/platform-service`.
- Define the initial auth strategy:
  - JWT only, or
  - JWT + API tokens for automation.
- Define tool-level authorization rules (viewer/admin/super_admin) and how they’re enforced.
- Agree on a consistent product scoping rule:
  - `productId` is mandatory input for every tool call and sent as `x-product-id` downstream.

### Decisions to lock

- **MCP server shape**
  - Single server with namespaces (`platform.*`, `extraction.*`) vs. two servers.
- **Auth mechanism**
  - Primary: platform-service JWT.
  - Secondary: platform API tokens for trusted automation.

### Required invariants

- `x-request-id` propagated on all downstream calls.
- `x-product-id` required and validated for all calls.

## Phase 1 — MVP MCP server (P0 slice) (3–7 days)

### Goal

Enable a Support/Ops agent to answer: _“What’s happening?”_ and _“Start a bounded diagnostics session.”_

### Tool surface (MVP)

#### Read-only (viewer)

- `telemetry.queryEvents(filters)`
- `telemetry.listClusters(filters)`
- `telemetry.listPolicies()`
- `telemetry.getMetrics()`
- `diagnostics.getSession(sessionId)`
- `diagnostics.listSessions(filters)`
- `diagnostics.getLogs(sessionId, filters)`
- `diagnostics.getTraces(sessionId, filters)`
- `extraction.sidecarHealth()`

### Phase 1 prerequisites (to avoid hidden integration failures)

- Align `@bytelyst/diagnostics-client` ingest endpoint with `platform-service`.
  - Today the service ingests via `POST /api/diagnostics/sessions/:id/logs|traces` (and session-scoped screenshot upload), while the client posts to `POST /api/diagnostics/ingest`.
  - Decision: update the client to use the session-scoped endpoints (no backwards-compat alias endpoint).
- Decide whether `@bytelyst/telemetry-client` should become policy-aware by consuming `GET /api/telemetry/config`.
  - If yes, treat it as a Phase 1 deliverable (with caching + ETag). Otherwise, explicitly defer to Phase 2.

#### Mutating (admin)

- `telemetry.previewPolicy(targeting)`
- `telemetry.createPolicy(input)`
  - requires `expiresAt`
- `telemetry.updatePolicy(id, updates)`
- `telemetry.updateClusterStatus(clusterId, pk, status)`
- `diagnostics.createSession(target, config)`
  - requires `expiresAt` or `expiresInMinutes`
- `diagnostics.updateSession(sessionId, updates)`
- `diagnostics.cancelSession(sessionId)`

#### Compound workflow tool (admin)

- `support.createDebugPack(input)`
  - internally calls:
    - telemetry queries
    - optional diagnostics session create + polling
  - returns a single structured artifact:
    - `debugPackId`, `clusterRefs`, `sessionRefs`, and a markdown summary

### Output contracts (schemas)

Define explicit JSON schemas (Zod or equivalent) for:

- `TelemetryFilters`
- `TelemetryPolicyInput` (requires expiry)
- `DiagnosticsSessionTarget`
- `DiagnosticsSessionConfig`
- `DebugPackRequest`
- `DebugPackResponse`

### Guardrails

- Query limits:
  - default `limit` and max `limit` enforced at MCP layer.
- Policy guardrails:
  - require `expiresAt`
  - require explicit `eventTypes/modules`
  - block wildcard collection unless super_admin
- Diagnostics guardrails:
  - enforce max duration
  - enforce max capture volume per flush

### Acceptance criteria

- A single agent can:
  - pull clusters for a product + time window
  - propose a telemetry policy and preview targeting
  - create an expiring policy
  - start a diagnostics session and retrieve data
  - generate a “Debug Pack” artifact

### Phase 1 engineering checklist (definition of done)

- MCP layer enforces:
  - request-id propagation (`x-request-id`)
  - required product scoping (`productId`)
  - default query caps and maximum caps
  - expiry requirements for policies/sessions
- Every mutating tool call produces an audit record (either:
  - by calling existing audit endpoints, or
  - by ensuring the underlying platform-service endpoint already records audit)
- Tool names and inputs are documented and stable (no breaking renames during Phase 1)

## Phase 2 — A2A orchestration (1–2 weeks)

### Goal

Turn multi-step support/ops workflows into **repeatable agent playbooks** with explicit handoffs.

### Standard agents

- **DispatcherAgent**
- **TelemetryAnalystAgent**
- **DiagnosticsOrchestratorAgent**
- **OpsExecutorAgent**
- **ReportWriterAgent**

### Handoff artifacts

- `SupportIncidentBrief`
- `TelemetryFindings`
- `DiagnosticsSessionPlan`
- `OpsChangePlan`
- `FinalIncidentReport`

### Acceptance criteria

- “Support Debug Pack” runs end-to-end via:
  - Dispatcher → Telemetry Analyst → Diagnostics Orchestrator → Report Writer
- Every handoff is persisted (even if only in logs initially) with stable IDs.

## Phase 3 — Extraction task iteration loop (1–3 weeks)

### Goal

Make extraction prompt/task improvements safe, testable, and regression-resistant.

### MCP tools

- `extraction.extract(text, taskId?, modelId?)`
- `extraction.extractBatch(inputs)`
- `extraction.submitJob(inputs, webhookUrl?)`
- `extraction.getJob(jobId)`
- `extraction.metrics()` / `extraction.cacheStats()`

### A2A workflow

- **TaskDesignerAgent** drafts task prompt + examples
- **EvalRunnerAgent** runs batch eval sets
- **RegressionAgent** compares to baseline
- **PublisherAgent** updates task registry + rollout

### Acceptance criteria

- A single command/workflow can:
  - run eval suite
  - compute simple quality metrics (schema validity, required fields coverage)
  - produce a report and recommended next edit

## Phase 4 — Ops expansion (jobs/flags/maintenance/webhooks) (1–2 weeks)

### Tools

- `jobs.list`, `jobs.trigger`, `jobs.listRuns`
- `flags.list`, `flags.upsert`, `flags.evaluate`
- `maintenance.get`, `maintenance.set`
- `webhooks.listSubscriptions`, `webhooks.test`, `webhooks.rotateSecret`

### Acceptance criteria

- “Ops Copilot” can safely execute a bounded change plan:
  - propose change
  - dry-run if supported
  - execute with audit
  - verify outcome

## Security & privacy checklist

- Explicit `productId` on every tool call
- Avoid returning raw PII in tool results
- Ensure diagnostics redaction remains enforced server-side
- Enforce expirations on policies and sessions
- Rate limit MCP server endpoints

## Rollout strategy

- Start with internal-only usage (super_admin).
- Add admin roles once guardrails are proven.
- Add viewer read-only access for broader teams.
- Add product-specific namespaces only after platform namespaces stabilize.

## Work breakdown (suggested)

- **Milestone A**: MVP MCP server + telemetry/diagnostics read-only
- **Milestone B**: mutating tools + dry-run/expiry enforcement
- **Milestone C**: `support.createDebugPack` compound tool
- **Milestone D**: A2A runner + handoff schemas
- **Milestone E**: extraction eval loop

## Open questions (need decisions)

- Should the MCP server call services via:
  - service REST endpoints only, or
  - direct Cosmos reads for some query paths?
- Where should A2A handoff artifacts be stored:
  - telemetry events,
  - a dedicated Cosmos container,
  - or both?
- Do we want one MCP server repo/package, or colocated under `platform-service`?