8.7 KiB
8.7 KiB
MCP + A2A — Implementation Plan (Execution-Ready)
Objective
Deliver a safe, auditable, product-aware MCP + A2A capability layer on top of existing ByteLyst services (primarily platform-service and extraction-service) so that agents can:
- diagnose incidents (telemetry + diagnostics)
- manage telemetry policies / rollouts
- orchestrate remote diagnostics sessions
- iterate on extraction tasks/prompts with eval loops
- run repeatable ops workflows (jobs, flags, maintenance)
This plan intentionally starts with a minimal P0 slice and expands in phases.
Guiding constraints (must-haves)
- Product isolation
- Every tool call is scoped to an explicit
productId.
- Every tool call is scoped to an explicit
- Auditability
- Every mutating tool call must produce an audit record (directly or via existing APIs).
- Least privilege
- Query tools available to viewer roles; mutating tools gated to admin/super_admin.
- Safety defaults
- Mutations support
dryRunwhere feasible. - Diagnostic amplification (policies/sessions) must require an
expiresAt.
- Mutations support
- No new “shadow APIs”
- Prefer calling existing service endpoints; only add endpoints if the tool surface cannot be expressed otherwise.
Phase 0 — Baseline readiness (1–2 days)
Deliverables
- A “tool surface” inventory mapped to existing REST endpoints and required headers.
- A role matrix for tool authorization.
- A request-id propagation and logging standard for MCP tool calls.
Phase 0 checklist (definition of done)
- Confirm whether MCP tool implementations will call REST only (preferred) or allow direct Cosmos reads for selected query paths.
- Choose whether MCP is:
- a new service/package in
learning_ai_common_plat, or - colocated under
services/platform-service.
- a new service/package in
- Define the initial auth strategy:
- JWT only, or
- JWT + API tokens for automation.
- Define tool-level authorization rules (viewer/admin/super_admin) and how they’re enforced.
- Agree on a consistent product scoping rule:
productIdis mandatory input for every tool call and sent asx-product-iddownstream.
Decisions to lock
- MCP server shape
- Single server with namespaces (
platform.*,extraction.*) vs. two servers.
- Single server with namespaces (
- Auth mechanism
- Primary: platform-service JWT.
- Secondary: platform API tokens for trusted automation.
Required invariants
x-request-idpropagated on all downstream calls.x-product-idrequired and validated for all calls.
Phase 1 — MVP MCP server (P0 slice) (3–7 days)
Goal
Enable a Support/Ops agent to answer: “What’s happening?” and “Start a bounded diagnostics session.”
Tool surface (MVP)
Read-only (viewer)
telemetry.queryEvents(filters)telemetry.listClusters(filters)telemetry.listPolicies()telemetry.getMetrics()diagnostics.getSession(sessionId)diagnostics.listSessions(filters)diagnostics.getLogs(sessionId, filters)diagnostics.getTraces(sessionId, filters)extraction.sidecarHealth()
Phase 1 prerequisites (to avoid hidden integration failures)
- Align
@bytelyst/diagnostics-clientingest endpoint withplatform-service.- Today the service ingests via
POST /api/diagnostics/sessions/:id/logs|traces(and session-scoped screenshot upload), while the client posts toPOST /api/diagnostics/ingest. - Decision: update the client to use the session-scoped endpoints (no backwards-compat alias endpoint).
- Today the service ingests via
- Decide whether
@bytelyst/telemetry-clientshould become policy-aware by consumingGET /api/telemetry/config.- If yes, treat it as a Phase 1 deliverable (with caching + ETag). Otherwise, explicitly defer to Phase 2.
Mutating (admin)
telemetry.previewPolicy(targeting)telemetry.createPolicy(input)- requires
expiresAt
- requires
telemetry.updatePolicy(id, updates)telemetry.updateClusterStatus(clusterId, pk, status)diagnostics.createSession(target, config)- requires
expiresAtorexpiresInMinutes
- requires
diagnostics.updateSession(sessionId, updates)diagnostics.cancelSession(sessionId)
Compound workflow tool (admin)
support.createDebugPack(input)- internally calls:
- telemetry queries
- optional diagnostics session create + polling
- returns a single structured artifact:
debugPackId,clusterRefs,sessionRefs, and a markdown summary
- internally calls:
Output contracts (schemas)
Define explicit JSON schemas (Zod or equivalent) for:
TelemetryFiltersTelemetryPolicyInput(requires expiry)DiagnosticsSessionTargetDiagnosticsSessionConfigDebugPackRequestDebugPackResponse
Guardrails
- Query limits:
- default
limitand maxlimitenforced at MCP layer.
- default
- Policy guardrails:
- require
expiresAt - require explicit
eventTypes/modules - block wildcard collection unless super_admin
- require
- Diagnostics guardrails:
- enforce max duration
- enforce max capture volume per flush
Acceptance criteria
- A single agent can:
- pull clusters for a product + time window
- propose a telemetry policy and preview targeting
- create an expiring policy
- start a diagnostics session and retrieve data
- generate a “Debug Pack” artifact
Phase 1 engineering checklist (definition of done)
- MCP layer enforces:
- request-id propagation (
x-request-id) - required product scoping (
productId) - default query caps and maximum caps
- expiry requirements for policies/sessions
- request-id propagation (
- Every mutating tool call produces an audit record (either:
- by calling existing audit endpoints, or
- by ensuring the underlying platform-service endpoint already records audit)
- Tool names and inputs are documented and stable (no breaking renames during Phase 1)
Phase 2 — A2A orchestration (1–2 weeks)
Goal
Turn multi-step support/ops workflows into repeatable agent playbooks with explicit handoffs.
Standard agents
- DispatcherAgent
- TelemetryAnalystAgent
- DiagnosticsOrchestratorAgent
- OpsExecutorAgent
- ReportWriterAgent
Handoff artifacts
SupportIncidentBriefTelemetryFindingsDiagnosticsSessionPlanOpsChangePlanFinalIncidentReport
Acceptance criteria
- “Support Debug Pack” runs end-to-end via:
- Dispatcher → Telemetry Analyst → Diagnostics Orchestrator → Report Writer
- Every handoff is persisted (even if only in logs initially) with stable IDs.
Phase 3 — Extraction task iteration loop (1–3 weeks)
Goal
Make extraction prompt/task improvements safe, testable, and regression-resistant.
MCP tools
extraction.extract(text, taskId?, modelId?)extraction.extractBatch(inputs)extraction.submitJob(inputs, webhookUrl?)extraction.getJob(jobId)extraction.metrics()/extraction.cacheStats()
A2A workflow
- TaskDesignerAgent drafts task prompt + examples
- EvalRunnerAgent runs batch eval sets
- RegressionAgent compares to baseline
- PublisherAgent updates task registry + rollout
Acceptance criteria
- A single command/workflow can:
- run eval suite
- compute simple quality metrics (schema validity, required fields coverage)
- produce a report and recommended next edit
Phase 4 — Ops expansion (jobs/flags/maintenance/webhooks) (1–2 weeks)
Tools
jobs.list,jobs.trigger,jobs.listRunsflags.list,flags.upsert,flags.evaluatemaintenance.get,maintenance.setwebhooks.listSubscriptions,webhooks.test,webhooks.rotateSecret
Acceptance criteria
- “Ops Copilot” can safely execute a bounded change plan:
- propose change
- dry-run if supported
- execute with audit
- verify outcome
Security & privacy checklist
- Explicit
productIdon every tool call - Avoid returning raw PII in tool results
- Ensure diagnostics redaction remains enforced server-side
- Enforce expirations on policies and sessions
- Rate limit MCP server endpoints
Rollout strategy
- Start with internal-only usage (super_admin).
- Add admin roles once guardrails are proven.
- Add viewer read-only access for broader teams.
- Add product-specific namespaces only after platform namespaces stabilize.
Work breakdown (suggested)
- Milestone A: MVP MCP server + telemetry/diagnostics read-only
- Milestone B: mutating tools + dry-run/expiry enforcement
- Milestone C:
support.createDebugPackcompound tool - Milestone D: A2A runner + handoff schemas
- Milestone E: extraction eval loop
Open questions (need decisions)
- Should the MCP server call services via:
- service REST endpoints only, or
- direct Cosmos reads for some query paths?
- Where should A2A handoff artifacts be stored:
- telemetry events,
- a dedicated Cosmos container,
- or both?
- Do we want one MCP server repo/package, or colocated under
platform-service?