saravanakumardb1 bf7769bdaa fix(diagnostics-client): use session-scoped ingest endpoints; update MCP+A2A docs

2026-03-05 10:41:02 -08:00

8.7 KiB

Raw Blame History

MCP + A2A — Implementation Plan (Execution-Ready)

Objective

Deliver a safe, auditable, product-aware MCP + A2A capability layer on top of existing ByteLyst services (primarily platform-service and extraction-service) so that agents can:

diagnose incidents (telemetry + diagnostics)
manage telemetry policies / rollouts
orchestrate remote diagnostics sessions
iterate on extraction tasks/prompts with eval loops
run repeatable ops workflows (jobs, flags, maintenance)

This plan intentionally starts with a minimal P0 slice and expands in phases.

Guiding constraints (must-haves)

Product isolation
- Every tool call is scoped to an explicit productId.
Auditability
- Every mutating tool call must produce an audit record (directly or via existing APIs).
Least privilege
- Query tools available to viewer roles; mutating tools gated to admin/super_admin.
Safety defaults
- Mutations support dryRun where feasible.
- Diagnostic amplification (policies/sessions) must require an expiresAt.
No new “shadow APIs”
- Prefer calling existing service endpoints; only add endpoints if the tool surface cannot be expressed otherwise.

Phase 0 — Baseline readiness (1–2 days)

Deliverables

A “tool surface” inventory mapped to existing REST endpoints and required headers.
A role matrix for tool authorization.
A request-id propagation and logging standard for MCP tool calls.

Phase 0 checklist (definition of done)

Confirm whether MCP tool implementations will call REST only (preferred) or allow direct Cosmos reads for selected query paths.
Choose whether MCP is:
- a new service/package in learning_ai_common_plat, or
- colocated under services/platform-service.
Define the initial auth strategy:
- JWT only, or
- JWT + API tokens for automation.
Define tool-level authorization rules (viewer/admin/super_admin) and how they’re enforced.
Agree on a consistent product scoping rule:
- productId is mandatory input for every tool call and sent as x-product-id downstream.

Decisions to lock

MCP server shape
- Single server with namespaces (platform.*, extraction.*) vs. two servers.
Auth mechanism
- Primary: platform-service JWT.
- Secondary: platform API tokens for trusted automation.

Required invariants

x-request-id propagated on all downstream calls.
x-product-id required and validated for all calls.

Phase 1 — MVP MCP server (P0 slice) (3–7 days)

Goal

Enable a Support/Ops agent to answer: “What’s happening?” and “Start a bounded diagnostics session.”

Tool surface (MVP)

Read-only (viewer)

telemetry.queryEvents(filters)
telemetry.listClusters(filters)
telemetry.listPolicies()
telemetry.getMetrics()
diagnostics.getSession(sessionId)
diagnostics.listSessions(filters)
diagnostics.getLogs(sessionId, filters)
diagnostics.getTraces(sessionId, filters)
extraction.sidecarHealth()

Phase 1 prerequisites (to avoid hidden integration failures)

Align @bytelyst/diagnostics-client ingest endpoint with platform-service.
- Today the service ingests via POST /api/diagnostics/sessions/:id/logs|traces (and session-scoped screenshot upload), while the client posts to POST /api/diagnostics/ingest.
- Decision: update the client to use the session-scoped endpoints (no backwards-compat alias endpoint).
Decide whether @bytelyst/telemetry-client should become policy-aware by consuming GET /api/telemetry/config.
- If yes, treat it as a Phase 1 deliverable (with caching + ETag). Otherwise, explicitly defer to Phase 2.

Mutating (admin)

telemetry.previewPolicy(targeting)
telemetry.createPolicy(input)
- requires expiresAt
telemetry.updatePolicy(id, updates)
telemetry.updateClusterStatus(clusterId, pk, status)
diagnostics.createSession(target, config)
- requires expiresAt or expiresInMinutes
diagnostics.updateSession(sessionId, updates)
diagnostics.cancelSession(sessionId)

Compound workflow tool (admin)

support.createDebugPack(input)
- internally calls:
  - telemetry queries
  - optional diagnostics session create + polling
- returns a single structured artifact:
  - debugPackId, clusterRefs, sessionRefs, and a markdown summary

Output contracts (schemas)

Define explicit JSON schemas (Zod or equivalent) for:

TelemetryFilters
TelemetryPolicyInput (requires expiry)
DiagnosticsSessionTarget
DiagnosticsSessionConfig
DebugPackRequest
DebugPackResponse

Guardrails

Query limits:
- default limit and max limit enforced at MCP layer.
Policy guardrails:
- require expiresAt
- require explicit eventTypes/modules
- block wildcard collection unless super_admin
Diagnostics guardrails:
- enforce max duration
- enforce max capture volume per flush

Acceptance criteria

A single agent can:
- pull clusters for a product + time window
- propose a telemetry policy and preview targeting
- create an expiring policy
- start a diagnostics session and retrieve data
- generate a “Debug Pack” artifact

Phase 1 engineering checklist (definition of done)

MCP layer enforces:
- request-id propagation (x-request-id)
- required product scoping (productId)
- default query caps and maximum caps
- expiry requirements for policies/sessions
Every mutating tool call produces an audit record (either:
- by calling existing audit endpoints, or
- by ensuring the underlying platform-service endpoint already records audit)
Tool names and inputs are documented and stable (no breaking renames during Phase 1)

Phase 2 — A2A orchestration (1–2 weeks)

Goal

Turn multi-step support/ops workflows into repeatable agent playbooks with explicit handoffs.

Standard agents

DispatcherAgent
TelemetryAnalystAgent
DiagnosticsOrchestratorAgent
OpsExecutorAgent
ReportWriterAgent

Handoff artifacts

SupportIncidentBrief
TelemetryFindings
DiagnosticsSessionPlan
OpsChangePlan
FinalIncidentReport

Acceptance criteria

“Support Debug Pack” runs end-to-end via:
- Dispatcher → Telemetry Analyst → Diagnostics Orchestrator → Report Writer
Every handoff is persisted (even if only in logs initially) with stable IDs.

Phase 3 — Extraction task iteration loop (1–3 weeks)

Goal

Make extraction prompt/task improvements safe, testable, and regression-resistant.

MCP tools

extraction.extract(text, taskId?, modelId?)
extraction.extractBatch(inputs)
extraction.submitJob(inputs, webhookUrl?)
extraction.getJob(jobId)
extraction.metrics() / extraction.cacheStats()

A2A workflow

TaskDesignerAgent drafts task prompt + examples
EvalRunnerAgent runs batch eval sets
RegressionAgent compares to baseline
PublisherAgent updates task registry + rollout

Acceptance criteria

A single command/workflow can:
- run eval suite
- compute simple quality metrics (schema validity, required fields coverage)
- produce a report and recommended next edit

Phase 4 — Ops expansion (jobs/flags/maintenance/webhooks) (1–2 weeks)

Tools

jobs.list, jobs.trigger, jobs.listRuns
flags.list, flags.upsert, flags.evaluate
maintenance.get, maintenance.set
webhooks.listSubscriptions, webhooks.test, webhooks.rotateSecret

Acceptance criteria

“Ops Copilot” can safely execute a bounded change plan:
- propose change
- dry-run if supported
- execute with audit
- verify outcome

Security & privacy checklist

Explicit productId on every tool call
Avoid returning raw PII in tool results
Ensure diagnostics redaction remains enforced server-side
Enforce expirations on policies and sessions
Rate limit MCP server endpoints

Rollout strategy

Start with internal-only usage (super_admin).
Add admin roles once guardrails are proven.
Add viewer read-only access for broader teams.
Add product-specific namespaces only after platform namespaces stabilize.

Work breakdown (suggested)

Milestone A: MVP MCP server + telemetry/diagnostics read-only
Milestone B: mutating tools + dry-run/expiry enforcement
Milestone C: support.createDebugPack compound tool
Milestone D: A2A runner + handoff schemas
Milestone E: extraction eval loop

Open questions (need decisions)

Should the MCP server call services via:
- service REST endpoints only, or
- direct Cosmos reads for some query paths?
Where should A2A handoff artifacts be stored:
- telemetry events,
- a dedicated Cosmos container,
- or both?
Do we want one MCP server repo/package, or colocated under platform-service?

8.7 KiB Raw Blame History Unescape Escape

MCP + A2A — Implementation Plan (Execution-Ready)

Objective

Guiding constraints (must-haves)

Phase 0 — Baseline readiness (1–2 days)

Deliverables

Phase 0 checklist (definition of done)

Decisions to lock

Required invariants

Phase 1 — MVP MCP server (P0 slice) (3–7 days)

Goal

Tool surface (MVP)

Read-only (viewer)

Phase 1 prerequisites (to avoid hidden integration failures)

Mutating (admin)

Compound workflow tool (admin)

Output contracts (schemas)

Guardrails

Acceptance criteria

Phase 1 engineering checklist (definition of done)

Phase 2 — A2A orchestration (1–2 weeks)

Goal

Standard agents

Handoff artifacts

Acceptance criteria

Phase 3 — Extraction task iteration loop (1–3 weeks)

Goal

MCP tools

A2A workflow

Acceptance criteria

Phase 4 — Ops expansion (jobs/flags/maintenance/webhooks) (1–2 weeks)

Tools

Acceptance criteria

Security & privacy checklist

Rollout strategy

Work breakdown (suggested)

Open questions (need decisions)

8.7 KiB

Raw Blame History