# MCP Server Framework — Recommended Architecture (ByteLyst)

## Why an MCP server here

This workspace already has a clear separation of concerns:

- **Authoritative services** (Fastify): `platform-service`, `extraction-service`, plus product backends.
- **Dashboards** (Next.js): admin + tracker.
- **Client SDKs**: Swift/Kotlin platform SDKs + TS client packages.

An MCP server becomes the **single programmatic gateway** that agents can call to:

- query/act on platform state
- assemble debugging evidence
- run repeatable ops workflows
- safely orchestrate A2A agents

## Core design constraints

- **Do not bypass service invariants**
  - Prefer calling service endpoints or repositories with the same validation (Zod) and auth.
- **Auditability**
  - Every mutating tool should emit audit logs (or call APIs that already do).
- **Least privilege**
  - Split tools by role (viewer/admin/super_admin).
- **Product isolation**
  - All tools/resources must explicitly bind to `productId`.

## Reality check: what exists today

- `platform-service` already exposes:
  - `GET /api/telemetry/config` (ETag-based client collection config)
  - `GET /api/telemetry/query`, `GET /api/telemetry/clusters`, policies CRUD (admin)
  - diagnostics session CRUD + `GET /api/diagnostics/sessions/:id/logs|traces|screenshots` (admin)
- `extraction-service` already exposes:
  - `/extract`, `/extract/batch`, `/extract/jobs`, sidecar health, metrics, cache stats

The primary “new work” for MCP is orchestration, safety gating, and consistent auth/audit — not inventing new primitives.

## Proposed MCP servers (2-tier)

### 1) `bytelyst-platform-mcp` (primary)

Backed by `platform-service` (port 4003) and optionally Cosmos for direct reads.

- **Responsibilities**
  - Telemetry querying + policy management
  - Remote diagnostics sessions orchestration
  - Jobs trigger/list
  - Flags/settings/maintenance
  - Webhooks + delivery logs
  - Audit query

### 2) `bytelyst-extraction-mcp` (specialized)

Backed by `extraction-service` (port 4005).

- **Responsibilities**
  - Extract / batch extract
  - Submit and monitor async extraction jobs
  - Sidecar health + circuit breaker insight
  - Metrics + cache stats

(Optionally, these can be a single MCP server with two namespaces.)

## Tool taxonomy

### A) Read-only tools

- `telemetry.queryEvents`
- `telemetry.listClusters`
- `telemetry.getMetrics`
- `diagnostics.getSession`
- `diagnostics.getLogs/getTraces`
- `jobs.list/listRuns`
- `flags.list`
- `settings.get`
- `webhooks.listSubscriptions/listDeliveries`
- `extraction.metrics/cacheStats/sidecarHealth`

### B) Mutating tools (require elevated role)

- `telemetry.createPolicy/updatePolicy/deletePolicy`
- `telemetry.updateClusterStatus`
- `diagnostics.createSession/updateSession/cancelSession`
- `jobs.trigger`
- `maintenance.set`
- `flags.set` (or flag upserts)
- `webhooks.rotateSecret` / `webhooks.test`
- `extraction.rateLimitReset` (if you keep that admin endpoint)

### C) Compound tools (“one tool = one workflow”)

- `support.createDebugPack(reportInput)`
  - pulls telemetry timeline + cluster context
  - optionally starts diagnostics session
  - returns a single structured artifact (markdown/json)

This reduces prompt fragility vs. requiring the LLM to call 8 tools in the right order.

## MCP resources

Resources should be stable references agents can read repeatedly:

- `platform.modules.index`
  - module list + base routes + required headers
- `telemetry.schema`
- `diagnostics.schema`
- `extraction.tasks.catalog`
- `ops.runbooks`
  - e.g. “how to debug iOS keyboard insert_noop”
- `product.identity`
  - productId, plan tiers, allowed baseUrls

## Prompts (MCP prompt templates)

- `prompt.support_triage`
- `prompt.telemetry_policy_proposal`
- `prompt.remote_diagnostics_session_plan`
- `prompt.extraction_task_design`

## Authentication & authorization

- **Primary**: platform-service JWT (same `verifyToken` logic).
- **Secondary**: service-to-service API tokens (only for trusted automation).
- **Tool gating**
  - viewer: query-only
  - admin: policy updates, create diagnostics sessions
  - super_admin: secret rotation, maintenance, destructive operations

## Observability for the MCP server

- Use structured logs (Fastify/pino style) and propagate `x-request-id`.
- Record tool invocation metrics into `telemetry` as `backend_service` channel:
  - module: `mcp`
  - eventName: `tool_invoked`, `tool_failed`, `a2a_handoff`

## Safe defaults / guardrails

- Any mutating tool should support a `dryRun: true` mode.
- Enforce `expiresAt` on any “diagnostic collection amplification” (telemetry policy, diagnostics session).
- Cap queries by default (limit/pageSize), require explicit `limit` increases.

## Known integration risk (fix early)

- `@bytelyst/diagnostics-client` currently flushes to `POST /api/diagnostics/ingest`, while `platform-service` ingests via session-scoped endpoints.
- Resolve this mismatch before using diagnostics tooling as a core MCP/A2A workflow dependency.
  - Decision: update `@bytelyst/diagnostics-client` to post to `POST /api/diagnostics/sessions/:id/logs|traces`.

## Suggested initial tool surface (minimal viable)

- `telemetry.queryEvents`, `telemetry.listClusters`, `telemetry.listPolicies`, `telemetry.previewPolicy`, `telemetry.createPolicy`
- `diagnostics.createSession`, `diagnostics.getSession`, `diagnostics.getLogs`, `diagnostics.getTraces`
- `extraction.extract`, `extraction.extractBatch`, `extraction.sidecarHealth`
- `jobs.list`, `jobs.trigger`