bytelyst/learning_ai_common_plat

Fork 0

saravanakumardb1 bf7769bdaa fix(diagnostics-client): use session-scoped ingest endpoints; update MCP+A2A docs

2026-03-05 10:41:02 -08:00

5.4 KiB

Raw Blame History

MCP Server Framework — Recommended Architecture (ByteLyst)

Why an MCP server here

This workspace already has a clear separation of concerns:

Authoritative services (Fastify): platform-service, extraction-service, plus product backends.
Dashboards (Next.js): admin + tracker.
Client SDKs: Swift/Kotlin platform SDKs + TS client packages.

An MCP server becomes the single programmatic gateway that agents can call to:

query/act on platform state
assemble debugging evidence
run repeatable ops workflows
safely orchestrate A2A agents

Core design constraints

Do not bypass service invariants
- Prefer calling service endpoints or repositories with the same validation (Zod) and auth.
Auditability
- Every mutating tool should emit audit logs (or call APIs that already do).
Least privilege
- Split tools by role (viewer/admin/super_admin).
Product isolation
- All tools/resources must explicitly bind to productId.

Reality check: what exists today

platform-service already exposes:
- GET /api/telemetry/config (ETag-based client collection config)
- GET /api/telemetry/query, GET /api/telemetry/clusters, policies CRUD (admin)
- diagnostics session CRUD + GET /api/diagnostics/sessions/:id/logs|traces|screenshots (admin)
extraction-service already exposes:
- /extract, /extract/batch, /extract/jobs, sidecar health, metrics, cache stats

The primary “new work” for MCP is orchestration, safety gating, and consistent auth/audit — not inventing new primitives.

Proposed MCP servers (2-tier)

1) `bytelyst-platform-mcp` (primary)

Backed by platform-service (port 4003) and optionally Cosmos for direct reads.

Responsibilities
- Telemetry querying + policy management
- Remote diagnostics sessions orchestration
- Jobs trigger/list
- Flags/settings/maintenance
- Webhooks + delivery logs
- Audit query

2) `bytelyst-extraction-mcp` (specialized)

Backed by extraction-service (port 4005).

Responsibilities
- Extract / batch extract
- Submit and monitor async extraction jobs
- Sidecar health + circuit breaker insight
- Metrics + cache stats

(Optionally, these can be a single MCP server with two namespaces.)

Tool taxonomy

A) Read-only tools

telemetry.queryEvents
telemetry.listClusters
telemetry.getMetrics
diagnostics.getSession
diagnostics.getLogs/getTraces
jobs.list/listRuns
flags.list
settings.get
webhooks.listSubscriptions/listDeliveries
extraction.metrics/cacheStats/sidecarHealth

B) Mutating tools (require elevated role)

telemetry.createPolicy/updatePolicy/deletePolicy
telemetry.updateClusterStatus
diagnostics.createSession/updateSession/cancelSession
jobs.trigger
maintenance.set
flags.set (or flag upserts)
webhooks.rotateSecret / webhooks.test
extraction.rateLimitReset (if you keep that admin endpoint)

C) Compound tools (“one tool = one workflow”)

support.createDebugPack(reportInput)
- pulls telemetry timeline + cluster context
- optionally starts diagnostics session
- returns a single structured artifact (markdown/json)

This reduces prompt fragility vs. requiring the LLM to call 8 tools in the right order.

MCP resources

Resources should be stable references agents can read repeatedly:

platform.modules.index
- module list + base routes + required headers
telemetry.schema
diagnostics.schema
extraction.tasks.catalog
ops.runbooks
- e.g. “how to debug iOS keyboard insert_noop”
product.identity
- productId, plan tiers, allowed baseUrls

Prompts (MCP prompt templates)

prompt.support_triage
prompt.telemetry_policy_proposal
prompt.remote_diagnostics_session_plan
prompt.extraction_task_design

Authentication & authorization

Primary: platform-service JWT (same verifyToken logic).
Secondary: service-to-service API tokens (only for trusted automation).
Tool gating
- viewer: query-only
- admin: policy updates, create diagnostics sessions
- super_admin: secret rotation, maintenance, destructive operations

Observability for the MCP server

Use structured logs (Fastify/pino style) and propagate x-request-id.
Record tool invocation metrics into telemetry as backend_service channel:
- module: mcp
- eventName: tool_invoked, tool_failed, a2a_handoff

Safe defaults / guardrails

Any mutating tool should support a dryRun: true mode.
Enforce expiresAt on any “diagnostic collection amplification” (telemetry policy, diagnostics session).
Cap queries by default (limit/pageSize), require explicit limit increases.

Known integration risk (fix early)

@bytelyst/diagnostics-client currently flushes to POST /api/diagnostics/ingest, while platform-service ingests via session-scoped endpoints.
Resolve this mismatch before using diagnostics tooling as a core MCP/A2A workflow dependency.
- Decision: update @bytelyst/diagnostics-client to post to POST /api/diagnostics/sessions/:id/logs|traces.

Suggested initial tool surface (minimal viable)

telemetry.queryEvents, telemetry.listClusters, telemetry.listPolicies, telemetry.previewPolicy, telemetry.createPolicy
diagnostics.createSession, diagnostics.getSession, diagnostics.getLogs, diagnostics.getTraces
extraction.extract, extraction.extractBatch, extraction.sidecarHealth
jobs.list, jobs.trigger

5.4 KiB Raw Blame History