5.4 KiB
5.4 KiB
MCP Server Framework — Recommended Architecture (ByteLyst)
Why an MCP server here
This workspace already has a clear separation of concerns:
- Authoritative services (Fastify):
platform-service,extraction-service, plus product backends. - Dashboards (Next.js): admin + tracker.
- Client SDKs: Swift/Kotlin platform SDKs + TS client packages.
An MCP server becomes the single programmatic gateway that agents can call to:
- query/act on platform state
- assemble debugging evidence
- run repeatable ops workflows
- safely orchestrate A2A agents
Core design constraints
- Do not bypass service invariants
- Prefer calling service endpoints or repositories with the same validation (Zod) and auth.
- Auditability
- Every mutating tool should emit audit logs (or call APIs that already do).
- Least privilege
- Split tools by role (viewer/admin/super_admin).
- Product isolation
- All tools/resources must explicitly bind to
productId.
- All tools/resources must explicitly bind to
Reality check: what exists today
platform-servicealready exposes:GET /api/telemetry/config(ETag-based client collection config)GET /api/telemetry/query,GET /api/telemetry/clusters, policies CRUD (admin)- diagnostics session CRUD +
GET /api/diagnostics/sessions/:id/logs|traces|screenshots(admin)
extraction-servicealready exposes:/extract,/extract/batch,/extract/jobs, sidecar health, metrics, cache stats
The primary “new work” for MCP is orchestration, safety gating, and consistent auth/audit — not inventing new primitives.
Proposed MCP servers (2-tier)
1) bytelyst-platform-mcp (primary)
Backed by platform-service (port 4003) and optionally Cosmos for direct reads.
- Responsibilities
- Telemetry querying + policy management
- Remote diagnostics sessions orchestration
- Jobs trigger/list
- Flags/settings/maintenance
- Webhooks + delivery logs
- Audit query
2) bytelyst-extraction-mcp (specialized)
Backed by extraction-service (port 4005).
- Responsibilities
- Extract / batch extract
- Submit and monitor async extraction jobs
- Sidecar health + circuit breaker insight
- Metrics + cache stats
(Optionally, these can be a single MCP server with two namespaces.)
Tool taxonomy
A) Read-only tools
telemetry.queryEventstelemetry.listClusterstelemetry.getMetricsdiagnostics.getSessiondiagnostics.getLogs/getTracesjobs.list/listRunsflags.listsettings.getwebhooks.listSubscriptions/listDeliveriesextraction.metrics/cacheStats/sidecarHealth
B) Mutating tools (require elevated role)
telemetry.createPolicy/updatePolicy/deletePolicytelemetry.updateClusterStatusdiagnostics.createSession/updateSession/cancelSessionjobs.triggermaintenance.setflags.set(or flag upserts)webhooks.rotateSecret/webhooks.testextraction.rateLimitReset(if you keep that admin endpoint)
C) Compound tools (“one tool = one workflow”)
support.createDebugPack(reportInput)- pulls telemetry timeline + cluster context
- optionally starts diagnostics session
- returns a single structured artifact (markdown/json)
This reduces prompt fragility vs. requiring the LLM to call 8 tools in the right order.
MCP resources
Resources should be stable references agents can read repeatedly:
platform.modules.index- module list + base routes + required headers
telemetry.schemadiagnostics.schemaextraction.tasks.catalogops.runbooks- e.g. “how to debug iOS keyboard insert_noop”
product.identity- productId, plan tiers, allowed baseUrls
Prompts (MCP prompt templates)
prompt.support_triageprompt.telemetry_policy_proposalprompt.remote_diagnostics_session_planprompt.extraction_task_design
Authentication & authorization
- Primary: platform-service JWT (same
verifyTokenlogic). - Secondary: service-to-service API tokens (only for trusted automation).
- Tool gating
- viewer: query-only
- admin: policy updates, create diagnostics sessions
- super_admin: secret rotation, maintenance, destructive operations
Observability for the MCP server
- Use structured logs (Fastify/pino style) and propagate
x-request-id. - Record tool invocation metrics into
telemetryasbackend_servicechannel:- module:
mcp - eventName:
tool_invoked,tool_failed,a2a_handoff
- module:
Safe defaults / guardrails
- Any mutating tool should support a
dryRun: truemode. - Enforce
expiresAton any “diagnostic collection amplification” (telemetry policy, diagnostics session). - Cap queries by default (limit/pageSize), require explicit
limitincreases.
Known integration risk (fix early)
@bytelyst/diagnostics-clientcurrently flushes toPOST /api/diagnostics/ingest, whileplatform-serviceingests via session-scoped endpoints.- Resolve this mismatch before using diagnostics tooling as a core MCP/A2A workflow dependency.
- Decision: update
@bytelyst/diagnostics-clientto post toPOST /api/diagnostics/sessions/:id/logs|traces.
- Decision: update
Suggested initial tool surface (minimal viable)
telemetry.queryEvents,telemetry.listClusters,telemetry.listPolicies,telemetry.previewPolicy,telemetry.createPolicydiagnostics.createSession,diagnostics.getSession,diagnostics.getLogs,diagnostics.getTracesextraction.extract,extraction.extractBatch,extraction.sidecarHealthjobs.list,jobs.trigger