learning_ai_common_plat/docs/MCP+A2A/IMPLEMENTATION_PLAN.md

8.7 KiB
Raw Blame History

MCP + A2A — Implementation Plan (Execution-Ready)

Objective

Deliver a safe, auditable, product-aware MCP + A2A capability layer on top of existing ByteLyst services (primarily platform-service and extraction-service) so that agents can:

  • diagnose incidents (telemetry + diagnostics)
  • manage telemetry policies / rollouts
  • orchestrate remote diagnostics sessions
  • iterate on extraction tasks/prompts with eval loops
  • run repeatable ops workflows (jobs, flags, maintenance)

This plan intentionally starts with a minimal P0 slice and expands in phases.

Guiding constraints (must-haves)

  • Product isolation
    • Every tool call is scoped to an explicit productId.
  • Auditability
    • Every mutating tool call must produce an audit record (directly or via existing APIs).
  • Least privilege
    • Query tools available to viewer roles; mutating tools gated to admin/super_admin.
  • Safety defaults
    • Mutations support dryRun where feasible.
    • Diagnostic amplification (policies/sessions) must require an expiresAt.
  • No new “shadow APIs”
    • Prefer calling existing service endpoints; only add endpoints if the tool surface cannot be expressed otherwise.

Phase 0 — Baseline readiness (12 days)

Deliverables

  • A “tool surface” inventory mapped to existing REST endpoints and required headers.
  • A role matrix for tool authorization.
  • A request-id propagation and logging standard for MCP tool calls.

Phase 0 checklist (definition of done)

  • Confirm whether MCP tool implementations will call REST only (preferred) or allow direct Cosmos reads for selected query paths.
  • Choose whether MCP is:
    • a new service/package in learning_ai_common_plat, or
    • colocated under services/platform-service.
  • Define the initial auth strategy:
    • JWT only, or
    • JWT + API tokens for automation.
  • Define tool-level authorization rules (viewer/admin/super_admin) and how theyre enforced.
  • Agree on a consistent product scoping rule:
    • productId is mandatory input for every tool call and sent as x-product-id downstream.

Decisions to lock

  • MCP server shape
    • Single server with namespaces (platform.*, extraction.*) vs. two servers.
  • Auth mechanism
    • Primary: platform-service JWT.
    • Secondary: platform API tokens for trusted automation.

Required invariants

  • x-request-id propagated on all downstream calls.
  • x-product-id required and validated for all calls.

Phase 1 — MVP MCP server (P0 slice) (37 days)

Goal

Enable a Support/Ops agent to answer: “Whats happening?” and “Start a bounded diagnostics session.”

Tool surface (MVP)

Read-only (viewer)

  • telemetry.queryEvents(filters)
  • telemetry.listClusters(filters)
  • telemetry.listPolicies()
  • telemetry.getMetrics()
  • diagnostics.getSession(sessionId)
  • diagnostics.listSessions(filters)
  • diagnostics.getLogs(sessionId, filters)
  • diagnostics.getTraces(sessionId, filters)
  • extraction.sidecarHealth()

Phase 1 prerequisites (to avoid hidden integration failures)

  • Align @bytelyst/diagnostics-client ingest endpoint with platform-service.
    • Today the service ingests via POST /api/diagnostics/sessions/:id/logs|traces (and session-scoped screenshot upload), while the client posts to POST /api/diagnostics/ingest.
    • Decision: update the client to use the session-scoped endpoints (no backwards-compat alias endpoint).
  • Decide whether @bytelyst/telemetry-client should become policy-aware by consuming GET /api/telemetry/config.
    • If yes, treat it as a Phase 1 deliverable (with caching + ETag). Otherwise, explicitly defer to Phase 2.

Mutating (admin)

  • telemetry.previewPolicy(targeting)
  • telemetry.createPolicy(input)
    • requires expiresAt
  • telemetry.updatePolicy(id, updates)
  • telemetry.updateClusterStatus(clusterId, pk, status)
  • diagnostics.createSession(target, config)
    • requires expiresAt or expiresInMinutes
  • diagnostics.updateSession(sessionId, updates)
  • diagnostics.cancelSession(sessionId)

Compound workflow tool (admin)

  • support.createDebugPack(input)
    • internally calls:
      • telemetry queries
      • optional diagnostics session create + polling
    • returns a single structured artifact:
      • debugPackId, clusterRefs, sessionRefs, and a markdown summary

Output contracts (schemas)

Define explicit JSON schemas (Zod or equivalent) for:

  • TelemetryFilters
  • TelemetryPolicyInput (requires expiry)
  • DiagnosticsSessionTarget
  • DiagnosticsSessionConfig
  • DebugPackRequest
  • DebugPackResponse

Guardrails

  • Query limits:
    • default limit and max limit enforced at MCP layer.
  • Policy guardrails:
    • require expiresAt
    • require explicit eventTypes/modules
    • block wildcard collection unless super_admin
  • Diagnostics guardrails:
    • enforce max duration
    • enforce max capture volume per flush

Acceptance criteria

  • A single agent can:
    • pull clusters for a product + time window
    • propose a telemetry policy and preview targeting
    • create an expiring policy
    • start a diagnostics session and retrieve data
    • generate a “Debug Pack” artifact

Phase 1 engineering checklist (definition of done)

  • MCP layer enforces:
    • request-id propagation (x-request-id)
    • required product scoping (productId)
    • default query caps and maximum caps
    • expiry requirements for policies/sessions
  • Every mutating tool call produces an audit record (either:
    • by calling existing audit endpoints, or
    • by ensuring the underlying platform-service endpoint already records audit)
  • Tool names and inputs are documented and stable (no breaking renames during Phase 1)

Phase 2 — A2A orchestration (12 weeks)

Goal

Turn multi-step support/ops workflows into repeatable agent playbooks with explicit handoffs.

Standard agents

  • DispatcherAgent
  • TelemetryAnalystAgent
  • DiagnosticsOrchestratorAgent
  • OpsExecutorAgent
  • ReportWriterAgent

Handoff artifacts

  • SupportIncidentBrief
  • TelemetryFindings
  • DiagnosticsSessionPlan
  • OpsChangePlan
  • FinalIncidentReport

Acceptance criteria

  • “Support Debug Pack” runs end-to-end via:
    • Dispatcher → Telemetry Analyst → Diagnostics Orchestrator → Report Writer
  • Every handoff is persisted (even if only in logs initially) with stable IDs.

Phase 3 — Extraction task iteration loop (13 weeks)

Goal

Make extraction prompt/task improvements safe, testable, and regression-resistant.

MCP tools

  • extraction.extract(text, taskId?, modelId?)
  • extraction.extractBatch(inputs)
  • extraction.submitJob(inputs, webhookUrl?)
  • extraction.getJob(jobId)
  • extraction.metrics() / extraction.cacheStats()

A2A workflow

  • TaskDesignerAgent drafts task prompt + examples
  • EvalRunnerAgent runs batch eval sets
  • RegressionAgent compares to baseline
  • PublisherAgent updates task registry + rollout

Acceptance criteria

  • A single command/workflow can:
    • run eval suite
    • compute simple quality metrics (schema validity, required fields coverage)
    • produce a report and recommended next edit

Phase 4 — Ops expansion (jobs/flags/maintenance/webhooks) (12 weeks)

Tools

  • jobs.list, jobs.trigger, jobs.listRuns
  • flags.list, flags.upsert, flags.evaluate
  • maintenance.get, maintenance.set
  • webhooks.listSubscriptions, webhooks.test, webhooks.rotateSecret

Acceptance criteria

  • “Ops Copilot” can safely execute a bounded change plan:
    • propose change
    • dry-run if supported
    • execute with audit
    • verify outcome

Security & privacy checklist

  • Explicit productId on every tool call
  • Avoid returning raw PII in tool results
  • Ensure diagnostics redaction remains enforced server-side
  • Enforce expirations on policies and sessions
  • Rate limit MCP server endpoints

Rollout strategy

  • Start with internal-only usage (super_admin).
  • Add admin roles once guardrails are proven.
  • Add viewer read-only access for broader teams.
  • Add product-specific namespaces only after platform namespaces stabilize.

Work breakdown (suggested)

  • Milestone A: MVP MCP server + telemetry/diagnostics read-only
  • Milestone B: mutating tools + dry-run/expiry enforcement
  • Milestone C: support.createDebugPack compound tool
  • Milestone D: A2A runner + handoff schemas
  • Milestone E: extraction eval loop

Open questions (need decisions)

  • Should the MCP server call services via:
    • service REST endpoints only, or
    • direct Cosmos reads for some query paths?
  • Where should A2A handoff artifacts be stored:
    • telemetry events,
    • a dedicated Cosmos container,
    • or both?
  • Do we want one MCP server repo/package, or colocated under platform-service?