# MCP Server Framework — Recommended Architecture (ByteLyst) ## Why an MCP server here This workspace already has a clear separation of concerns: - **Authoritative services** (Fastify): `platform-service`, `extraction-service`, plus product backends. - **Dashboards** (Next.js): admin + tracker. - **Client SDKs**: Swift/Kotlin platform SDKs + TS client packages. An MCP server becomes the **single programmatic gateway** that agents can call to: - query/act on platform state - assemble debugging evidence - run repeatable ops workflows - safely orchestrate A2A agents ## Core design constraints - **Do not bypass service invariants** - Prefer calling service endpoints or repositories with the same validation (Zod) and auth. - **Auditability** - Every mutating tool should emit audit logs (or call APIs that already do). - **Least privilege** - Split tools by role (viewer/admin/super_admin). - **Product isolation** - All tools/resources must explicitly bind to `productId`. ## Reality check: what exists today - `platform-service` already exposes: - `GET /api/telemetry/config` (ETag-based client collection config) - `GET /api/telemetry/query`, `GET /api/telemetry/clusters`, policies CRUD (admin) - diagnostics session CRUD + `GET /api/diagnostics/sessions/:id/logs|traces|screenshots` (admin) - `extraction-service` already exposes: - `/extract`, `/extract/batch`, `/extract/jobs`, sidecar health, metrics, cache stats The primary “new work” for MCP is orchestration, safety gating, and consistent auth/audit — not inventing new primitives. ## Proposed MCP servers (2-tier) ### 1) `bytelyst-platform-mcp` (primary) Backed by `platform-service` (port 4003) and optionally Cosmos for direct reads. - **Responsibilities** - Telemetry querying + policy management - Remote diagnostics sessions orchestration - Jobs trigger/list - Flags/settings/maintenance - Webhooks + delivery logs - Audit query ### 2) `bytelyst-extraction-mcp` (specialized) Backed by `extraction-service` (port 4005). - **Responsibilities** - Extract / batch extract - Submit and monitor async extraction jobs - Sidecar health + circuit breaker insight - Metrics + cache stats (Optionally, these can be a single MCP server with two namespaces.) ## Tool taxonomy ### A) Read-only tools - `telemetry.queryEvents` - `telemetry.listClusters` - `telemetry.getMetrics` - `diagnostics.getSession` - `diagnostics.getLogs/getTraces` - `jobs.list/listRuns` - `flags.list` - `settings.get` - `webhooks.listSubscriptions/listDeliveries` - `extraction.metrics/cacheStats/sidecarHealth` ### B) Mutating tools (require elevated role) - `telemetry.createPolicy/updatePolicy/deletePolicy` - `telemetry.updateClusterStatus` - `diagnostics.createSession/updateSession/cancelSession` - `jobs.trigger` - `maintenance.set` - `flags.set` (or flag upserts) - `webhooks.rotateSecret` / `webhooks.test` - `extraction.rateLimitReset` (if you keep that admin endpoint) ### C) Compound tools (“one tool = one workflow”) - `support.createDebugPack(reportInput)` - pulls telemetry timeline + cluster context - optionally starts diagnostics session - returns a single structured artifact (markdown/json) This reduces prompt fragility vs. requiring the LLM to call 8 tools in the right order. ## MCP resources Resources should be stable references agents can read repeatedly: - `platform.modules.index` - module list + base routes + required headers - `telemetry.schema` - `diagnostics.schema` - `extraction.tasks.catalog` - `ops.runbooks` - e.g. “how to debug iOS keyboard insert_noop” - `product.identity` - productId, plan tiers, allowed baseUrls ## Prompts (MCP prompt templates) - `prompt.support_triage` - `prompt.telemetry_policy_proposal` - `prompt.remote_diagnostics_session_plan` - `prompt.extraction_task_design` ## Authentication & authorization - **Primary**: platform-service JWT (same `verifyToken` logic). - **Secondary**: service-to-service API tokens (only for trusted automation). - **Tool gating** - viewer: query-only - admin: policy updates, create diagnostics sessions - super_admin: secret rotation, maintenance, destructive operations ## Observability for the MCP server - Use structured logs (Fastify/pino style) and propagate `x-request-id`. - Record tool invocation metrics into `telemetry` as `backend_service` channel: - module: `mcp` - eventName: `tool_invoked`, `tool_failed`, `a2a_handoff` ## Safe defaults / guardrails - Any mutating tool should support a `dryRun: true` mode. - Enforce `expiresAt` on any “diagnostic collection amplification” (telemetry policy, diagnostics session). - Cap queries by default (limit/pageSize), require explicit `limit` increases. ## Known integration risk (fix early) - `@bytelyst/diagnostics-client` currently flushes to `POST /api/diagnostics/ingest`, while `platform-service` ingests via session-scoped endpoints. - Resolve this mismatch before using diagnostics tooling as a core MCP/A2A workflow dependency. - Decision: update `@bytelyst/diagnostics-client` to post to `POST /api/diagnostics/sessions/:id/logs|traces`. ## Suggested initial tool surface (minimal viable) - `telemetry.queryEvents`, `telemetry.listClusters`, `telemetry.listPolicies`, `telemetry.previewPolicy`, `telemetry.createPolicy` - `diagnostics.createSession`, `diagnostics.getSession`, `diagnostics.getLogs`, `diagnostics.getTraces` - `extraction.extract`, `extraction.extractBatch`, `extraction.sidecarHealth` - `jobs.list`, `jobs.trigger`