docs(roadmaps): add agent platform gap roadmap set

2026-03-14 14:34:08 +00:00 · 2026-03-14 14:34:08 +00:00 · d4c725a29d
commit d4c725a29d
parent 8ad3e1be34
11 changed files with 1129 additions and 0 deletions
--- a/docs/roadmaps/not-started/platform_AGENT_PLATFORM_GAP_ROADMAP_INDEX.md
+++ b/docs/roadmaps/not-started/platform_AGENT_PLATFORM_GAP_ROADMAP_INDEX.md
@ -0,0 +1,109 @@
+# Agent Platform Gaps - Roadmap Index
+
+> **Purpose:** Turn the current agent-company platform gaps into an actionable roadmap set.
+>
+> **Scope:** `learning_ai_common_plat`
+>
+> **Date:** 2026-03-14
+>
+> **Status:** Planned
+
+---
+
+## Executive Summary
+
+The shared platform already covers a large amount of generic SaaS infrastructure:
+auth, telemetry, diagnostics, flags, delivery, jobs, marketplace, billing-related
+modules, extraction, MCP tooling, and durable queue primitives.
+
+What is still missing is the **agent control plane**:
+
+1. durable agent run orchestration
+2. org/workspace/team/RBAC
+3. agent registry and prompt versioning
+4. reusable knowledge/RAG
+5. human review and approval queue
+6. support case management
+7. durable cross-service eventing and worker runtime
+8. centralized AI governance and evals
+9. AI budget and cost governance
+10. enterprise provisioning and SCIM
+
+This roadmap set breaks those gaps into separate implementation documents so they
+can be sequenced without mixing concerns.
+
+---
+
+## Roadmap Set
+
+1. [Agent Runtime & Orchestration Roadmap](./platform_AGENT_RUNTIME_ORCHESTRATION_ROADMAP.md)
+2. [Org, Workspace & RBAC Roadmap](./platform_ORG_WORKSPACE_RBAC_ROADMAP.md)
+3. [Agent Registry & Prompt Versioning Roadmap](./platform_AGENT_REGISTRY_PROMPT_VERSIONING_ROADMAP.md)
+4. [Knowledge & RAG Service Roadmap](./platform_KNOWLEDGE_RAG_SERVICE_ROADMAP.md)
+5. [Human Review & Approval Queue Roadmap](./platform_HUMAN_REVIEW_APPROVAL_QUEUE_ROADMAP.md)
+6. [Support Case Management Roadmap](./platform_SUPPORT_CASE_MANAGEMENT_ROADMAP.md)
+7. [Durable Event Bus & Worker Runtime Roadmap](./platform_DURABLE_EVENT_BUS_AND_WORKER_RUNTIME_ROADMAP.md)
+8. [AI Governance & Evaluation Roadmap](./platform_AI_GOVERNANCE_EVALS_ROADMAP.md)
+9. [AI Budget & Cost Governance Roadmap](./platform_AI_BUDGET_COST_GOVERNANCE_ROADMAP.md)
+10. [Enterprise Provisioning & SCIM Roadmap](./platform_ENTERPRISE_PROVISIONING_SCIM_ROADMAP.md)
+
+---
+
+## Existing Repo Signals
+
+These gaps are not theoretical. The current codebase already shows the partial
+foundations and the missing layers:
+
+- Durable queue primitives now exist in `packages/queue/`, but agent orchestration in
+  `services/mcp-server/src/modules/a2a/runner.ts` is still primarily log-driven.
+- `platform-service` has broad product infrastructure, but there is no first-class
+  org/workspace/team module under `services/platform-service/src/modules/`.
+- Enterprise IdP support exists in `services/platform-service/src/modules/auth/enterprise/`,
+  but enterprise provisioning does not.
+- The event bus in `packages/events/src/memory.ts` is in-process only.
+- `ai-diagnostics` already uses embeddings and vector similarity, but there is no reusable
+  knowledge service for general agent retrieval.
+- MFA push approvals exist, but there is no general review queue for agent actions.
+
+---
+
+## Recommended Build Order
+
+### P0
+
+1. Agent Runtime & Orchestration
+2. Durable Event Bus & Worker Runtime
+3. Org, Workspace & RBAC
+4. Human Review & Approval Queue
+
+### P1
+
+5. Agent Registry & Prompt Versioning
+6. Knowledge & RAG Service
+7. AI Budget & Cost Governance
+8. AI Governance & Evaluation
+
+### P2
+
+9. Enterprise Provisioning & SCIM
+10. Support Case Management
+
+---
+
+## Architectural Guidance
+
+These docs assume the current repo direction remains:
+
+- TypeScript + Fastify services
+- shared `@bytelyst/*` packages
+- `platform-service` as control plane
+- `mcp-server` as operator and A2A interface
+
+However, some missing capabilities are more naturally relational or workflow-heavy
+than the current Cosmos-first platform modules. Each roadmap therefore includes:
+
+- a **recommended stack** for long-term quality
+- a **repo-fit alternative** that stays closer to current conventions
+
+That is intentional. The best industry-standard choice is not always the same as
+the least disruptive repo-local choice.
--- a/docs/roadmaps/not-started/platform_AGENT_REGISTRY_PROMPT_VERSIONING_ROADMAP.md
+++ b/docs/roadmaps/not-started/platform_AGENT_REGISTRY_PROMPT_VERSIONING_ROADMAP.md
@ -0,0 +1,94 @@
+# Agent Registry & Prompt Versioning Roadmap
+
+> **Purpose:** Create a system of record for agents, prompts, tools, versions,
+> rollout states, and release governance.
+>
+> **Primary Surfaces:** `services/platform-service/`, `services/mcp-server/`
+>
+> **Status:** Planned
+>
+> **Estimated Effort:** 2-3 weeks
+
+---
+
+## Why This Is Missing
+
+The repo has MCP tools and A2A pipelines, but it does not have a persistent registry
+for the definitions that power them. Without that, agent behavior is embedded in code
+and docs rather than treated as versioned platform data.
+
+---
+
+## Recommended Stack
+
+- **PostgreSQL** for metadata and version history
+- **Blob storage or Git-backed artifacts** for prompt files and larger assets
+- **OpenTelemetry** for linking versions to production runs
+
+### Repo-Fit Alternative
+
+- Cosmos-backed registry module in `platform-service`
+- Prompt artifacts stored in blob storage
+- MCP server resolves active versions from `platform-service`
+
+---
+
+## Phase 1 - Core Registry
+
+- [ ] Create modules:
+  - [ ] `agent-registry`
+  - [ ] `prompt-registry`
+  - [ ] `tool-bundles`
+- [ ] Add entities:
+  - [ ] `AgentDefinition`
+  - [ ] `AgentVersion`
+  - [ ] `PromptTemplate`
+  - [ ] `PromptVersion`
+  - [ ] `ToolBundle`
+  - [ ] `ReleaseChannel`
+- [ ] Track:
+  - [ ] owner
+  - [ ] changelog
+  - [ ] status: `draft`, `staged`, `active`, `deprecated`, `archived`
+  - [ ] compatibility constraints
+
+**Acceptance Criteria**
+
+- Every production agent has a durable version record
+- Prompt changes are diffable and auditable
+
+---
+
+## Phase 2 - Rollouts & Safety
+
+- [ ] Add staged rollouts by product, org, workspace, or cohort
+- [ ] Add allowlists and freeze controls
+- [ ] Add prompt approval requirement for sensitive agents
+- [ ] Add rollback support
+- [ ] Link agent versions to eval results and incidents
+
+---
+
+## Phase 3 - Runtime Integration
+
+- [ ] `mcp-server` loads active definitions from registry rather than code-only defaults
+- [ ] `agent-runs` store `agentVersion` and `promptVersion`
+- [ ] support replay against older versions for regression analysis
+
+---
+
+## Tech Stack Options
+
+| Option                           | Pros                       | Cons                           | Fit                         |
+| -------------------------------- | -------------------------- | ------------------------------ | --------------------------- |
+| PostgreSQL + blob storage        | Strong relational history  | New datastore                  | Best long-term              |
+| Git as source of truth + sync DB | Great developer ergonomics | Dual-source complexity         | Good for prompt-heavy teams |
+| Cosmos + blob storage            | Consistent with repo       | Version queries less ergonomic | Good short-term             |
+
+---
+
+## Risks
+
+- Code-only prompt management creates invisible production drift
+- Without version pinning, incident replay and audit are weak
+- Registry without rollout controls is just a metadata catalog
--- a/docs/roadmaps/not-started/platform_AGENT_RUNTIME_ORCHESTRATION_ROADMAP.md
+++ b/docs/roadmaps/not-started/platform_AGENT_RUNTIME_ORCHESTRATION_ROADMAP.md
@ -0,0 +1,146 @@
+# Agent Runtime & Orchestration Roadmap
+
+> **Purpose:** Build a durable execution layer for agent runs, step transitions,
+> cancellations, retries, resumability, and operator-visible history.
+>
+> **Primary Surfaces:** `services/platform-service/`, `services/mcp-server/`,
+> `packages/queue/`
+>
+> **Status:** Planned
+>
+> **Estimated Effort:** 3-5 weeks
+
+---
+
+## Why This Is Missing
+
+The repo now has durable queue primitives, but agent execution is still not a
+first-class platform service. A2A pipelines in `services/mcp-server/src/modules/a2a/`
+are composed code paths rather than durable runs with persistent step state.
+
+That is enough for prototypes. It is not enough for:
+
+- long-running multi-step agents
+- retries after process restarts
+- human escalation in the middle of a run
+- cancellation and pause/resume
+- auditable run history
+
+---
+
+## Recommended Stack
+
+### Best Long-Term Industry Standard
+
+- **Temporal** for workflow orchestration
+- **PostgreSQL** for run metadata and operator queries
+- **Redis** for short-lived coordination and cache
+
+### Best Repo-Fit Option
+
+- `@bytelyst/queue` for durable job dispatch
+- `platform-service` run records in Cosmos or datastore abstraction
+- `mcp-server` as orchestration client and tool executor
+
+### Recommendation
+
+Start with the repo-fit option to get durable runs quickly, but design the run model
+so a later move to Temporal is possible without rewriting every agent contract.
+
+---
+
+## Phase 1 - Canonical Run Model
+
+- [ ] Create `services/platform-service/src/modules/agent-runs/`
+- [ ] Define `AgentRunDoc`, `AgentRunStepDoc`, `AgentRunEventDoc`
+- [ ] Support states: `queued`, `running`, `waiting_for_input`, `paused`, `succeeded`, `failed`, `cancelled`
+- [ ] Add `parentRunId`, `workflowId`, `agentId`, `agentVersion`, `triggerSource`
+- [ ] Persist step inputs, outputs, timings, error summaries, and correlation IDs
+- [ ] Add APIs:
+  - [ ] `POST /agent-runs`
+  - [ ] `GET /agent-runs/:id`
+  - [ ] `GET /agent-runs/:id/events`
+  - [ ] `POST /agent-runs/:id/cancel`
+  - [ ] `POST /agent-runs/:id/pause`
+  - [ ] `POST /agent-runs/:id/resume`
+
+**Acceptance Criteria**
+
+- Every agent run has durable metadata and step history
+- A run can be fetched after service restart
+- Cancellation and pause are explicit states, not implicit errors
+
+---
+
+## Phase 2 - Queue-Backed Execution
+
+- [ ] Add `agent.run.execute` queue type on top of `@bytelyst/queue`
+- [ ] Add per-step retries with backoff
+- [ ] Add lease heartbeat for long-running steps
+- [ ] Add idempotency keys for replays
+- [ ] Add dead-letter handling and operator inspection
+- [ ] Record structured run events for step started/completed/failed/retried
+
+**Acceptance Criteria**
+
+- In-flight runs survive worker restart
+- Retried steps do not duplicate side effects when idempotency is configured
+- Dead-lettered runs are queryable and replayable
+
+---
+
+## Phase 3 - A2A Integration
+
+- [ ] Replace direct A2A pipeline progression in `mcp-server` with run orchestration APIs
+- [ ] Make every pipeline step emit durable run events
+- [ ] Support handoff to human review queue
+- [ ] Support child runs for delegated agent tasks
+- [ ] Add run-level audit links to diagnostics, telemetry, and support systems
+
+**Acceptance Criteria**
+
+- `mcp-server` no longer owns the durable run state itself
+- A2A pipelines are observable step-by-step
+- Human review can pause and later resume a run
+
+---
+
+## Phase 4 - Operator Experience
+
+- [ ] Admin UI for runs, filters, and replay
+- [ ] Timeline view per run
+- [ ] Step diff view for prompt/tool transitions
+- [ ] Cancel/retry/replay controls
+- [ ] SLOs: success rate, mean run duration, retry rate, dead-letter count
+
+---
+
+## Tech Stack Options
+
+| Option                       | Pros                                              | Cons                              | Fit                     |
+| ---------------------------- | ------------------------------------------------- | --------------------------------- | ----------------------- |
+| Temporal                     | Best workflow semantics, retries, signals, timers | New infra, steeper learning curve | Best long-term          |
+| BullMQ + Redis + run DB      | Simple, common in Node                            | Workflow semantics are custom     | Strong practical option |
+| `@bytelyst/queue` + run docs | Lowest disruption to repo                         | More framework logic to build     | Best immediate path     |
+
+---
+
+## Risks
+
+- Custom orchestration can become a weak in-house Temporal clone if not scoped tightly
+- If step contracts are not versioned, replay becomes unsafe
+- If all state remains in logs, operator tooling will never be reliable
+
+---
+
+## Recommendation
+
+Implement the v1 run system inside `platform-service` using `@bytelyst/queue`, but
+borrow Temporal-style concepts from day one:
+
+- workflow ID
+- run ID
+- signals
+- child runs
+- durable timers
+- explicit waiting states
--- a/docs/roadmaps/not-started/platform_AI_BUDGET_COST_GOVERNANCE_ROADMAP.md
+++ b/docs/roadmaps/not-started/platform_AI_BUDGET_COST_GOVERNANCE_ROADMAP.md
@ -0,0 +1,94 @@
+# AI Budget & Cost Governance Roadmap
+
+> **Purpose:** Add per-tenant and per-agent controls for model spend, quotas,
+> budgets, alerts, and invoiceable AI usage.
+>
+> **Primary Surface:** `services/platform-service/`
+>
+> **Status:** Planned
+>
+> **Estimated Effort:** 2-3 weeks
+
+---
+
+## Why This Is Missing
+
+Model usage is often more volatile than standard API usage. Agent companies need
+controls for:
+
+- daily and monthly spend caps
+- per-workspace or per-agent budgets
+- model allowlists and deny rules
+- burst protection
+- usage attribution for billing
+
+The repo has usage and billing modules, but not a dedicated AI cost governance layer.
+
+---
+
+## Recommended Stack
+
+- `platform-service` cost governance module
+- usage ledger in Cosmos or PostgreSQL
+- provider-specific pricing tables
+- alerting through Slack, Telegram, and email
+
+### Recommendation
+
+This fits naturally in `platform-service`. The key is not the datastore; it is the
+quality of attribution and enforcement.
+
+---
+
+## Phase 1 - Usage Ledger
+
+- [ ] Create modules:
+  - [ ] `ai-usage`
+  - [ ] `ai-budgets`
+  - [ ] `ai-pricing`
+- [ ] Store:
+  - [ ] tenant
+  - [ ] workspace
+  - [ ] agent
+  - [ ] provider
+  - [ ] model
+  - [ ] tokens or units
+  - [ ] cost estimate
+  - [ ] request correlation ID
+
+---
+
+## Phase 2 - Enforcement
+
+- [ ] Preflight budget checks before expensive calls
+- [ ] rate and spend throttles
+- [ ] model allowlists by tenant
+- [ ] degrade-to-cheaper-model policy
+- [ ] hard cap vs soft cap behavior
+
+---
+
+## Phase 3 - Visibility
+
+- [ ] Admin reports by tenant, agent, provider, and model
+- [ ] budget burn-down alerts
+- [ ] anomaly detection for spend spikes
+- [ ] export for finance and customer invoicing
+
+---
+
+## Tech Stack Options
+
+| Option                                  | Pros                              | Cons                          | Fit                       |
+| --------------------------------------- | --------------------------------- | ----------------------------- | ------------------------- |
+| Platform-native ledger + pricing tables | Full control and tenant awareness | Requires pricing upkeep       | Best fit                  |
+| External spend tool only                | Fast bootstrap                    | Weak product attribution      | Limited                   |
+| Billing-module extension only           | Less module sprawl                | AI-specific logic gets buried | Acceptable but less clear |
+
+---
+
+## Risks
+
+- Without spend controls, one bad prompt or loop can create material cost
+- Without tenant attribution, enterprise billing becomes unreliable
+- Without enforcement, dashboards become retrospective only
--- a/docs/roadmaps/not-started/platform_AI_GOVERNANCE_EVALS_ROADMAP.md
+++ b/docs/roadmaps/not-started/platform_AI_GOVERNANCE_EVALS_ROADMAP.md
@ -0,0 +1,92 @@
+# AI Governance & Evaluation Roadmap
+
+> **Purpose:** Centralize evals, policy enforcement, safety review, release gates,
+> and regression tracking for prompts, agents, and model behavior.
+>
+> **Primary Surfaces:** `services/platform-service/`, `services/extraction-service/`,
+> `services/mcp-server/`
+>
+> **Status:** Planned
+>
+> **Estimated Effort:** 3-5 weeks
+
+---
+
+## Why This Is Missing
+
+The repo has useful pieces:
+
+- extraction evals
+- telemetry and diagnostics
+- flags and experiments
+
+What it does not have is a central AI governance surface that answers:
+
+- which prompts are approved
+- which eval suite a release passed
+- what policies apply to a class of agents
+- what changed after a model or prompt update
+
+---
+
+## Recommended Stack
+
+- `platform-service` governance modules
+- OpenTelemetry for trace-linked evidence
+- Promptfoo or a similar eval harness for offline regression
+- policy layer using code-first rules first, with optional Cedar or OPA later
+
+---
+
+## Phase 1 - Eval Registry
+
+- [ ] Create modules:
+  - [ ] `ai-evals`
+  - [ ] `ai-policies`
+  - [ ] `ai-releases`
+- [ ] Add entities:
+  - [ ] benchmark set
+  - [ ] eval run
+  - [ ] eval result
+  - [ ] policy decision
+  - [ ] release gate
+
+---
+
+## Phase 2 - Policy Engine
+
+- [ ] Add policy checks for:
+  - [ ] allowed models
+  - [ ] max temperature
+  - [ ] blocked tools
+  - [ ] required human review
+  - [ ] tenant-specific restrictions
+- [ ] Add release gates based on eval thresholds
+- [ ] Add regression detection on prompt or model changes
+
+---
+
+## Phase 3 - Operational Governance
+
+- [ ] Link agent and prompt versions to eval runs
+- [ ] Add incident-driven rollback recommendations
+- [ ] Add policy override audit logs
+- [ ] Add dashboards for pass rate, drift, and blocked releases
+
+---
+
+## Tech Stack Options
+
+| Option                        | Pros                       | Cons                     | Fit                 |
+| ----------------------------- | -------------------------- | ------------------------ | ------------------- |
+| Promptfoo + platform registry | Good current ecosystem fit | Need custom service glue | Best near-term      |
+| Custom eval runner only       | Full control               | Reinvents too much       | Weak starting point |
+| OPA/Cedar-backed governance   | Strong policy model        | More complexity          | Good phase 2+       |
+
+---
+
+## Risks
+
+- Shipping prompts without eval gating causes avoidable regressions
+- Governance only in docs will drift from runtime
+- No policy audit trail creates enterprise trust problems
--- a/docs/roadmaps/not-started/platform_DURABLE_EVENT_BUS_AND_WORKER_RUNTIME_ROADMAP.md
+++ b/docs/roadmaps/not-started/platform_DURABLE_EVENT_BUS_AND_WORKER_RUNTIME_ROADMAP.md
@ -0,0 +1,94 @@
+# Durable Event Bus & Worker Runtime Roadmap
+
+> **Purpose:** Replace in-process eventing and scattered background execution
+> with a durable cross-service event and worker backbone.
+>
+> **Primary Surfaces:** `packages/events/`, `packages/queue/`, `services/platform-service/`
+>
+> **Status:** Planned
+>
+> **Estimated Effort:** 3-4 weeks
+
+---
+
+## Why This Is Missing
+
+`packages/events/src/memory.ts` is an in-process event bus. That is useful for local
+dispatch inside one process, but it is not enough for:
+
+- cross-service subscriptions
+- replay
+- dead-letter handling
+- durable delivery
+- delayed fan-out
+
+The new `@bytelyst/queue` package improves durable background work, but the eventing
+layer is still incomplete.
+
+---
+
+## Recommended Stack
+
+### Best Long-Term Industry Standard
+
+- **Redis Streams** or **NATS JetStream** for durable event delivery
+- `@bytelyst/queue` or BullMQ for work execution
+- OpenTelemetry for trace correlation
+
+### Repo-Fit Option
+
+- Add a durable adapter to `@bytelyst/events`
+- Use Redis-backed delivery first
+- Keep current memory bus as test/dev adapter
+
+### Recommendation
+
+Use `@bytelyst/events` as the interface, but add a durable Redis or NATS adapter.
+Do not let direct in-memory emitters remain the production default for critical flows.
+
+---
+
+## Phase 1 - Event Abstraction
+
+- [ ] Extend `@bytelyst/events` to support pluggable backends
+- [ ] Keep `memory` for tests
+- [ ] Add `redis-streams` or `jetstream` adapter
+- [ ] Add consumer groups, ack, retry, and dead-letter support
+- [ ] Add correlation and causation IDs
+
+---
+
+## Phase 2 - Worker Runtime
+
+- [ ] Standardize worker bootstrap pattern
+- [ ] Add handler registration, concurrency controls, leases, and health endpoints
+- [ ] Add poison-message and dead-letter inspection
+- [ ] Add scheduling and delayed dispatch
+
+---
+
+## Phase 3 - Service Migration
+
+- [ ] Move delivery subscribers onto durable events
+- [ ] Move auth side effects off fire-and-forget local emitters
+- [ ] Move MCP/A2A transitions onto durable events where appropriate
+- [ ] Add observability for event lag and failure rate
+
+---
+
+## Tech Stack Options
+
+| Option          | Pros                                            | Cons                           | Fit                 |
+| --------------- | ----------------------------------------------- | ------------------------------ | ------------------- |
+| NATS JetStream  | Strong event semantics, lightweight             | New infra and integration work | Excellent long-term |
+| Redis Streams   | Familiar, easy to adopt with BullMQ-style stack | Less specialized than NATS     | Best pragmatic path |
+| Kafka           | Powerful at scale                               | Heavy operational footprint    | Overkill now        |
+| Memory bus only | Simple                                          | Not durable                    | Dev/test only       |
+
+---
+
+## Risks
+
+- In-process events hide failures and block cross-service reliability
+- Durable queues without durable events still leave side effects fragile
+- Multiple custom worker patterns will drift without a standard runtime
--- a/docs/roadmaps/not-started/platform_ENTERPRISE_PROVISIONING_SCIM_ROADMAP.md
+++ b/docs/roadmaps/not-started/platform_ENTERPRISE_PROVISIONING_SCIM_ROADMAP.md
@ -0,0 +1,81 @@
+# Enterprise Provisioning & SCIM Roadmap
+
+> **Purpose:** Extend enterprise identity from federation-only to full lifecycle
+> provisioning, deprovisioning, group sync, and seat governance.
+>
+> **Primary Surface:** `services/platform-service/src/modules/auth/enterprise/`
+>
+> **Status:** Planned
+>
+> **Estimated Effort:** 2-3 weeks
+
+---
+
+## Why This Is Missing
+
+The platform already has enterprise SAML and OIDC federation. That solves login.
+It does not solve enterprise lifecycle management:
+
+- just-in-time user provisioning policies
+- SCIM user sync
+- group sync
+- deprovisioning
+- seat and entitlement mapping
+
+---
+
+## Recommended Stack
+
+- Extend `platform-service` enterprise auth
+- SCIM 2.0 endpoints in Fastify
+- org/workspace mapping from the tenant model
+- optional background sync jobs using `@bytelyst/queue`
+
+---
+
+## Phase 1 - SCIM Foundations
+
+- [ ] Add SCIM service provider config endpoint
+- [ ] Add SCIM resource schemas
+- [ ] Add endpoints for:
+  - [ ] `/scim/v2/Users`
+  - [ ] `/scim/v2/Groups`
+  - [ ] PATCH
+  - [ ] deactivate
+- [ ] Add enterprise API tokens and audit logs
+
+---
+
+## Phase 2 - Provisioning Rules
+
+- [ ] Map SCIM users to org/workspace memberships
+- [ ] Map groups to roles or teams
+- [ ] Support seat assignment and revocation
+- [ ] Add deprovision grace policy
+
+---
+
+## Phase 3 - Admin Controls
+
+- [ ] Admin UI for provisioning state and sync errors
+- [ ] reconciliation jobs
+- [ ] audit exports
+- [ ] break-glass override flows
+
+---
+
+## Tech Stack Options
+
+| Option                          | Pros                                | Cons                                 | Fit                               |
+| ------------------------------- | ----------------------------------- | ------------------------------------ | --------------------------------- |
+| Native SCIM in platform-service | Full control, strong enterprise fit | Must implement spec carefully        | Best long-term                    |
+| IdP proxy product               | Faster setup                        | External dependency and less control | Acceptable only if needed quickly |
+| JIT only                        | Minimal effort                      | Not enough for enterprise IT         | Inadequate                        |
+
+---
+
+## Risks
+
+- Enterprise login without enterprise provisioning still creates admin pain
+- Group mapping drift leads to incorrect access
+- Deprovision lag is a real security risk
--- a/docs/roadmaps/not-started/platform_HUMAN_REVIEW_APPROVAL_QUEUE_ROADMAP.md
+++ b/docs/roadmaps/not-started/platform_HUMAN_REVIEW_APPROVAL_QUEUE_ROADMAP.md
@ -0,0 +1,93 @@
+# Human Review & Approval Queue Roadmap
+
+> **Purpose:** Add a generic human-in-the-loop system for agent actions,
+> escalations, approvals, and quality review.
+>
+> **Primary Surface:** `services/platform-service/`
+>
+> **Status:** Planned
+>
+> **Estimated Effort:** 2-3 weeks
+
+---
+
+## Why This Is Missing
+
+The platform has MFA push approvals, but that is a narrow auth flow. An agent company
+also needs a generic review queue for cases like:
+
+- send this message
+- execute this external action
+- publish this recommendation
+- approve this prompt change
+- inspect low-confidence output
+
+---
+
+## Recommended Stack
+
+- `platform-service` review module
+- `@bytelyst/queue` for routing and escalation timers
+- Slack and Telegram delivery adapters for reviewer notifications
+- Optional policy engine later with OpenFGA or Cedar
+
+---
+
+## Phase 1 - Review Objects
+
+- [ ] Create modules:
+  - [ ] `reviews`
+  - [ ] `approvals`
+  - [ ] `escalations`
+- [ ] Define review object fields:
+  - [ ] subject type
+  - [ ] subject ID
+  - [ ] review reason
+  - [ ] risk level
+  - [ ] required decision type
+  - [ ] assigned reviewer(s)
+  - [ ] SLA and due time
+  - [ ] supporting evidence
+- [ ] Add states:
+  - [ ] pending
+  - [ ] claimed
+  - [ ] approved
+  - [ ] rejected
+  - [ ] expired
+  - [ ] superseded
+
+---
+
+## Phase 2 - Workflow Integration
+
+- [ ] Allow agent runs to emit `waiting_for_review`
+- [ ] Add review decision callbacks to resume or cancel runs
+- [ ] Add escalation timers and reassignment
+- [ ] Add reviewer comments and audit trail
+
+---
+
+## Phase 3 - Reviewer Experience
+
+- [ ] API and admin UI queue
+- [ ] bulk claim and assignment
+- [ ] notification fan-out via Slack/Telegram/email
+- [ ] filters by risk, workspace, agent, age, reviewer
+
+---
+
+## Tech Stack Options
+
+| Option                                  | Pros                       | Cons                                           | Fit                 |
+| --------------------------------------- | -------------------------- | ---------------------------------------------- | ------------------- |
+| Platform module + queue + notifications | Simple and aligned to repo | More UI to build                               | Best immediate path |
+| Commercial ticketing/workflow tool      | Fast start                 | External dependency and poor control-plane fit | Poor long-term      |
+| Dedicated BPM engine                    | Powerful                   | Too heavy for initial need                     | Overkill initially  |
+
+---
+
+## Risks
+
+- If approvals are only implemented ad hoc per module, policy becomes inconsistent
+- If decisions are not audit logged, enterprise trust will be weak
+- Review queues without SLA and ownership become dead letter inboxes
--- a/docs/roadmaps/not-started/platform_KNOWLEDGE_RAG_SERVICE_ROADMAP.md
+++ b/docs/roadmaps/not-started/platform_KNOWLEDGE_RAG_SERVICE_ROADMAP.md
@ -0,0 +1,117 @@
+# Knowledge & RAG Service Roadmap
+
+> **Purpose:** Build a shared knowledge platform for ingestion, chunking,
+> embeddings, retrieval, citations, and access-controlled context assembly.
+>
+> **Primary Surfaces:** `services/platform-service/`, `services/extraction-service/`
+>
+> **Status:** Planned
+>
+> **Estimated Effort:** 4-6 weeks
+
+---
+
+## Why This Is Missing
+
+The repo already has extraction and some vector-based diagnostics work, but there is
+no reusable platform service for general retrieval-augmented generation across
+products and agents.
+
+Every serious agent company eventually needs:
+
+- managed document ingestion
+- chunking and metadata
+- embeddings
+- retrieval APIs
+- citations and provenance
+- workspace-aware access control
+
+---
+
+## Recommended Stack
+
+### Best Long-Term Industry Standard
+
+- **PostgreSQL + pgvector** for integrated metadata + vector search
+- **Qdrant** if vector-first performance becomes dominant
+- **Blob storage** for source documents
+
+### Cloud-Native Alternative
+
+- **Azure AI Search** for retrieval
+- Cosmos or Postgres for metadata
+
+### Recommendation
+
+Use PostgreSQL + pgvector if you want the strongest balance of flexibility,
+ownership, and industry-standard retrieval patterns. Azure AI Search is a valid
+alternative if deep Azure integration matters more than datastore simplicity.
+
+---
+
+## Phase 1 - Knowledge Objects
+
+- [ ] Create modules:
+  - [ ] `knowledge-sources`
+  - [ ] `knowledge-documents`
+  - [ ] `knowledge-chunks`
+  - [ ] `knowledge-indexes`
+- [ ] Add ingestion states:
+  - [ ] uploaded
+  - [ ] parsed
+  - [ ] chunked
+  - [ ] embedded
+  - [ ] indexed
+  - [ ] failed
+- [ ] Add source provenance:
+  - [ ] filename
+  - [ ] URL
+  - [ ] connector type
+  - [ ] page or section references
+
+---
+
+## Phase 2 - Retrieval Pipeline
+
+- [ ] Add chunking service with configurable strategies
+- [ ] Add embedding generation pipeline
+- [ ] Add hybrid search:
+  - [ ] lexical
+  - [ ] vector
+  - [ ] metadata filters
+- [ ] Add citation builder and quote bounds
+- [ ] Add workspace and org scoping
+
+**Acceptance Criteria**
+
+- Retrieval returns chunks with citations and permission-safe metadata
+- Different products can share the same retrieval API
+
+---
+
+## Phase 3 - Connectors
+
+- [ ] File upload
+- [ ] Web page ingestion
+- [ ] Notes/workspace connector
+- [ ] Blob-backed ingestion
+- [ ] Optional Slack/Confluence/Google Drive connectors
+
+---
+
+## Tech Stack Options
+
+| Option                   | Pros                                       | Cons                         | Fit                              |
+| ------------------------ | ------------------------------------------ | ---------------------------- | -------------------------------- |
+| Postgres + pgvector      | Strong standard, unified metadata + vector | Requires new datastore       | Best overall                     |
+| Qdrant + metadata DB     | Great vector performance                   | Two systems to operate       | Good at scale                    |
+| Azure AI Search          | Strong managed search                      | Vendor-tighter coupling      | Best Azure-managed option        |
+| Cosmos vector workaround | Least disruption                           | Not ideal as main RAG engine | Avoid as primary long-term stack |
+
+---
+
+## Risks
+
+- Retrieval without access control causes data leakage between tenants
+- Retrieval without citations causes trust and compliance issues
+- Embeddings without source lifecycle management become stale quickly
--- a/docs/roadmaps/not-started/platform_ORG_WORKSPACE_RBAC_ROADMAP.md
+++ b/docs/roadmaps/not-started/platform_ORG_WORKSPACE_RBAC_ROADMAP.md
@ -0,0 +1,122 @@
+# Org, Workspace & RBAC Roadmap
+
+> **Purpose:** Add a first-class tenant model for organizations, workspaces,
+> teams, memberships, scoped roles, and admin governance.
+>
+> **Primary Surface:** `services/platform-service/`
+>
+> **Status:** Planned
+>
+> **Estimated Effort:** 3-4 weeks
+
+---
+
+## Why This Is Missing
+
+The platform has users and per-product memberships, but no canonical model for:
+
+- organizations
+- workspaces
+- teams
+- workspace-scoped roles
+- resource ownership and sharing
+
+Enterprise IdP support exists, but it does not replace a real tenant model.
+
+---
+
+## Recommended Stack
+
+### Best Long-Term Industry Standard
+
+- **PostgreSQL**
+- **Drizzle ORM** or **Prisma**
+- **OpenFGA** or Zanzibar-style authorization model for fine-grained access
+
+### Best Repo-Fit Option
+
+- `platform-service` module set backed by Cosmos
+- Role and membership evaluation in service code
+- Optional policy layer later using OpenFGA
+
+### Recommendation
+
+If tenanting will be central to the business, PostgreSQL is the better long-term
+fit because org/workspace membership is relational by nature. If short-term
+consistency matters more, start in Cosmos but keep the permission model portable.
+
+---
+
+## Phase 1 - Data Model
+
+- [ ] Create modules:
+  - [ ] `orgs`
+  - [ ] `workspaces`
+  - [ ] `teams`
+  - [ ] `memberships`
+  - [ ] `roles`
+- [ ] Define resources:
+  - [ ] organization
+  - [ ] workspace
+  - [ ] team
+  - [ ] service account
+  - [ ] API key
+- [ ] Define roles:
+  - [ ] `org_owner`
+  - [ ] `org_admin`
+  - [ ] `workspace_admin`
+  - [ ] `workspace_editor`
+  - [ ] `workspace_viewer`
+  - [ ] `support_operator`
+- [ ] Add invitation and deprovision flows
+
+**Acceptance Criteria**
+
+- Every protected resource can be tied to org/workspace ownership
+- Users can belong to multiple workspaces with different roles
+
+---
+
+## Phase 2 - Authorization
+
+- [ ] Add authorization helpers to `@bytelyst/auth` or a new `@bytelyst/authorization`
+- [ ] Evaluate permissions by resource and action
+- [ ] Add policy checks for:
+  - [ ] read
+  - [ ] write
+  - [ ] execute
+  - [ ] approve
+  - [ ] administer
+- [ ] Add service account and API key scopes
+
+**Acceptance Criteria**
+
+- Endpoints no longer rely only on flat `admin` vs `user`
+- Policies are testable and reusable across modules
+
+---
+
+## Phase 3 - Product Integration
+
+- [ ] Migrate existing modules that should be workspace-scoped
+- [ ] Add workspace headers or explicit route scoping
+- [ ] Connect enterprise IdP claims to org/workspace resolution
+- [ ] Add audit entries for membership and role changes
+
+---
+
+## Tech Stack Options
+
+| Option                      | Pros                                | Cons                             | Fit                    |
+| --------------------------- | ----------------------------------- | -------------------------------- | ---------------------- |
+| PostgreSQL + OpenFGA        | Best long-term for RBAC and sharing | New datastore + auth layer       | Best industry-standard |
+| PostgreSQL only             | Simpler than OpenFGA, still strong  | Fine-grained auth gets custom    | Good medium path       |
+| Cosmos + service-level RBAC | Lowest disruption                   | Harder joins and policy richness | Good short-term        |
+
+---
+
+## Risks
+
+- Flat roles will become a blocker for enterprise and multi-agent collaboration
+- Delaying workspace boundaries causes later data migrations
+- Fine-grained sharing is hard to retrofit once data models hardcode user ownership
--- a/docs/roadmaps/not-started/platform_SUPPORT_CASE_MANAGEMENT_ROADMAP.md
+++ b/docs/roadmaps/not-started/platform_SUPPORT_CASE_MANAGEMENT_ROADMAP.md
@ -0,0 +1,87 @@
+# Support Case Management Roadmap
+
+> **Purpose:** Build a platform-native case system for customer issues, agent
+> escalations, internal triage, and resolution tracking.
+>
+> **Primary Surfaces:** `services/platform-service/`, `services/mcp-server/`
+>
+> **Status:** Planned
+>
+> **Estimated Effort:** 3-4 weeks
+
+---
+
+## Why This Is Missing
+
+The repo has diagnostics, telemetry, debug tooling, and support-oriented MCP helpers.
+What it lacks is a canonical case record that ties them together.
+
+Without a case system, support work becomes fragmented across logs, chat, and ad hoc notes.
+
+---
+
+## Recommended Stack
+
+- `platform-service` case module
+- Cosmos or PostgreSQL for case records
+- Blob storage for attachments and debug packs
+- Notification hooks to Slack/Telegram/email
+
+### Recommendation
+
+This can live comfortably in `platform-service`. If the case domain becomes highly
+relational, PostgreSQL is better. Otherwise a Cosmos-backed module is acceptable.
+
+---
+
+## Phase 1 - Core Case Model
+
+- [ ] Create modules:
+  - [ ] `cases`
+  - [ ] `case-comments`
+  - [ ] `case-attachments`
+  - [ ] `case-links`
+- [ ] Track:
+  - [ ] customer or workspace
+  - [ ] severity
+  - [ ] status
+  - [ ] assignee
+  - [ ] linked run IDs
+  - [ ] linked diagnostics sessions
+  - [ ] linked incidents and releases
+
+---
+
+## Phase 2 - Operational Workflow
+
+- [ ] Add triage statuses and SLA timers
+- [ ] Add handoff between support, engineering, and operations
+- [ ] Add debug-pack ingestion
+- [ ] Add incident and case cross-links
+- [ ] Add case templates for common issue categories
+
+---
+
+## Phase 3 - Agent Integration
+
+- [ ] Let agents create draft cases from failed or escalated runs
+- [ ] Let support operators ask MCP tools for case-linked diagnostics
+- [ ] Add case summarization and next-step suggestions
+
+---
+
+## Tech Stack Options
+
+| Option                      | Pros                          | Cons                            | Fit                   |
+| --------------------------- | ----------------------------- | ------------------------------- | --------------------- |
+| Platform-native case module | Full control, integrates well | More work up front              | Best long-term        |
+| External helpdesk sync      | Faster bootstrap              | Split system of record          | Good only if required |
+| Ticket tool only            | Lowest build effort           | Weak agent-platform integration | Poor strategic fit    |
+
+---
+
+## Risks
+
+- No unified case object means poor support analytics and weak escalations
+- External-only support systems hide key agent and diagnostics context
+- If cases cannot link to runs and review queues, operators lose causal context