learning_ai_common_plat/docs/MCP+A2A/MCP_SERVER_FRAMEWORK.md

5.4 KiB

MCP Server Framework — Recommended Architecture (ByteLyst)

Why an MCP server here

This workspace already has a clear separation of concerns:

  • Authoritative services (Fastify): platform-service, extraction-service, plus product backends.
  • Dashboards (Next.js): admin + tracker.
  • Client SDKs: Swift/Kotlin platform SDKs + TS client packages.

An MCP server becomes the single programmatic gateway that agents can call to:

  • query/act on platform state
  • assemble debugging evidence
  • run repeatable ops workflows
  • safely orchestrate A2A agents

Core design constraints

  • Do not bypass service invariants
    • Prefer calling service endpoints or repositories with the same validation (Zod) and auth.
  • Auditability
    • Every mutating tool should emit audit logs (or call APIs that already do).
  • Least privilege
    • Split tools by role (viewer/admin/super_admin).
  • Product isolation
    • All tools/resources must explicitly bind to productId.

Reality check: what exists today

  • platform-service already exposes:
    • GET /api/telemetry/config (ETag-based client collection config)
    • GET /api/telemetry/query, GET /api/telemetry/clusters, policies CRUD (admin)
    • diagnostics session CRUD + GET /api/diagnostics/sessions/:id/logs|traces|screenshots (admin)
  • extraction-service already exposes:
    • /extract, /extract/batch, /extract/jobs, sidecar health, metrics, cache stats

The primary “new work” for MCP is orchestration, safety gating, and consistent auth/audit — not inventing new primitives.

Proposed MCP servers (2-tier)

1) bytelyst-platform-mcp (primary)

Backed by platform-service (port 4003) and optionally Cosmos for direct reads.

  • Responsibilities
    • Telemetry querying + policy management
    • Remote diagnostics sessions orchestration
    • Jobs trigger/list
    • Flags/settings/maintenance
    • Webhooks + delivery logs
    • Audit query

2) bytelyst-extraction-mcp (specialized)

Backed by extraction-service (port 4005).

  • Responsibilities
    • Extract / batch extract
    • Submit and monitor async extraction jobs
    • Sidecar health + circuit breaker insight
    • Metrics + cache stats

(Optionally, these can be a single MCP server with two namespaces.)

Tool taxonomy

A) Read-only tools

  • telemetry.queryEvents
  • telemetry.listClusters
  • telemetry.getMetrics
  • diagnostics.getSession
  • diagnostics.getLogs/getTraces
  • jobs.list/listRuns
  • flags.list
  • settings.get
  • webhooks.listSubscriptions/listDeliveries
  • extraction.metrics/cacheStats/sidecarHealth

B) Mutating tools (require elevated role)

  • telemetry.createPolicy/updatePolicy/deletePolicy
  • telemetry.updateClusterStatus
  • diagnostics.createSession/updateSession/cancelSession
  • jobs.trigger
  • maintenance.set
  • flags.set (or flag upserts)
  • webhooks.rotateSecret / webhooks.test
  • extraction.rateLimitReset (if you keep that admin endpoint)

C) Compound tools (“one tool = one workflow”)

  • support.createDebugPack(reportInput)
    • pulls telemetry timeline + cluster context
    • optionally starts diagnostics session
    • returns a single structured artifact (markdown/json)

This reduces prompt fragility vs. requiring the LLM to call 8 tools in the right order.

MCP resources

Resources should be stable references agents can read repeatedly:

  • platform.modules.index
    • module list + base routes + required headers
  • telemetry.schema
  • diagnostics.schema
  • extraction.tasks.catalog
  • ops.runbooks
    • e.g. “how to debug iOS keyboard insert_noop”
  • product.identity
    • productId, plan tiers, allowed baseUrls

Prompts (MCP prompt templates)

  • prompt.support_triage
  • prompt.telemetry_policy_proposal
  • prompt.remote_diagnostics_session_plan
  • prompt.extraction_task_design

Authentication & authorization

  • Primary: platform-service JWT (same verifyToken logic).
  • Secondary: service-to-service API tokens (only for trusted automation).
  • Tool gating
    • viewer: query-only
    • admin: policy updates, create diagnostics sessions
    • super_admin: secret rotation, maintenance, destructive operations

Observability for the MCP server

  • Use structured logs (Fastify/pino style) and propagate x-request-id.
  • Record tool invocation metrics into telemetry as backend_service channel:
    • module: mcp
    • eventName: tool_invoked, tool_failed, a2a_handoff

Safe defaults / guardrails

  • Any mutating tool should support a dryRun: true mode.
  • Enforce expiresAt on any “diagnostic collection amplification” (telemetry policy, diagnostics session).
  • Cap queries by default (limit/pageSize), require explicit limit increases.

Known integration risk (fix early)

  • @bytelyst/diagnostics-client currently flushes to POST /api/diagnostics/ingest, while platform-service ingests via session-scoped endpoints.
  • Resolve this mismatch before using diagnostics tooling as a core MCP/A2A workflow dependency.
    • Decision: update @bytelyst/diagnostics-client to post to POST /api/diagnostics/sessions/:id/logs|traces.

Suggested initial tool surface (minimal viable)

  • telemetry.queryEvents, telemetry.listClusters, telemetry.listPolicies, telemetry.previewPolicy, telemetry.createPolicy
  • diagnostics.createSession, diagnostics.getSession, diagnostics.getLogs, diagnostics.getTraces
  • extraction.extract, extraction.extractBatch, extraction.sidecarHealth
  • jobs.list, jobs.trigger