learning_ai_common_plat/docs/MCP+A2A/A2A_ORCHESTRATION_FRAMEWORK.md

3.5 KiB

A2A Orchestration Framework — Recommended Pattern (ByteLyst)

Intent

Standardize how multiple agents collaborate on platform + product tasks (support, ops, releases, prompt iterations) with:

  • explicit roles
  • explicit handoff artifacts
  • consistent safety + audit

Canonical roles (agents)

1) Router / Dispatcher

  • Responsibility
    • decide which specialist agent(s) to involve
    • enforce policy (role gating, PII constraints)
  • Inputs
    • user request + current environment (productId, repo)
  • Outputs
    • sequence of tasks (plan) + handoff payloads

2) Telemetry Analyst

  • Responsibility
    • find clusters, correlate by version/platform, propose next steps
  • Calls
    • telemetry MCP tools

3) Diagnostics Orchestrator

  • Responsibility
    • start / monitor remote diagnostics sessions
    • summarize results
  • Calls
    • diagnostics MCP tools

4) Extraction Task Designer

  • Responsibility
    • design extraction task prompts/examples
    • coordinate evaluation runs
  • Calls
    • extraction MCP tools

5) Ops Executor

  • Responsibility
    • execute mutating ops (jobs trigger, maintenance windows, flag changes)
  • Calls
    • platform MCP tools

6) Fix / PR Agent

  • Responsibility
    • implement code changes
    • ensure tests and conventions

7) Report Writer

  • Responsibility
    • produce a final summary in a consistent format
    • include links/IDs (clusterId, sessionId, policyId)

Handoff artifacts (contracts)

Examples below are illustrative. All artifacts must be explicitly scoped to a productId.

A) Support incident brief

{
  "productId": "<productId>",
  "userReport": {
    "summary": "dictation inserts nothing in Messages",
    "platform": "ios",
    "channel": "keyboard_extension",
    "appVersion": "1.2.0",
    "buildNumber": "35",
    "userId": "usr_...",
    "anonymousInstallId": "..."
  },
  "timeWindow": { "from": "...", "to": "..." }
}

Mapping note: userReport.anonymousInstallId maps to diagnostics session targeting via targetAnonymousId.

B) Telemetry findings

{
  "clusters": [{ "clusterId": "...", "pk": "...", "severity": "error" }],
  "topHypotheses": ["permission denied", "insertText noop"],
  "recommendedActions": ["start diagnostics session", "enable debug policy for one user"]
}

C) Diagnostics session plan

{
  "target": { "userId": "...", "deviceId": "..." },
  "collection": { "level": "trace", "captureNetwork": true, "captureLogs": true },
  "expiresInMinutes": 30
}

D) Patch plan (code)

  • scope, files, risk, tests

Routing logic (simple)

  • If request mentions:
    • "crash" / "not working" / "bug" → Telemetry Analyst → Diagnostics Orchestrator → Fix Agent
    • "extraction" / "entity" / "triage" → Extraction Task Designer → Eval Runner → Fix Agent
    • "maintenance" / "flag" / "job" → Ops Executor

Safety rules

  • Never include raw user content in telemetry/diagnostics.
  • Diagnostics sessions must be time-bounded.
  • Mutating actions require:
    • explicit approval from dispatcher
    • audit log
    • optional dry-run

Where A2A yields immediate wins in this workspace

  • Telemetry policy governance
    • Planner + Reviewer pattern
  • Remote diagnostics
    • Orchestrator agent that monitors sessions and summarizes
  • Prompt iteration loops (extraction)
    • Task designer + eval runner separation
  • Release workflows
    • Dedicated agent for quality gates (build/test/typecheck) and a separate agent for publishing