docs(roadmaps): add agent platform gap roadmap set
This commit is contained in:
parent
8ad3e1be34
commit
d4c725a29d
@ -0,0 +1,109 @@
|
||||
# Agent Platform Gaps - Roadmap Index
|
||||
|
||||
> **Purpose:** Turn the current agent-company platform gaps into an actionable roadmap set.
|
||||
>
|
||||
> **Scope:** `learning_ai_common_plat`
|
||||
>
|
||||
> **Date:** 2026-03-14
|
||||
>
|
||||
> **Status:** Planned
|
||||
|
||||
---
|
||||
|
||||
## Executive Summary
|
||||
|
||||
The shared platform already covers a large amount of generic SaaS infrastructure:
|
||||
auth, telemetry, diagnostics, flags, delivery, jobs, marketplace, billing-related
|
||||
modules, extraction, MCP tooling, and durable queue primitives.
|
||||
|
||||
What is still missing is the **agent control plane**:
|
||||
|
||||
1. durable agent run orchestration
|
||||
2. org/workspace/team/RBAC
|
||||
3. agent registry and prompt versioning
|
||||
4. reusable knowledge/RAG
|
||||
5. human review and approval queue
|
||||
6. support case management
|
||||
7. durable cross-service eventing and worker runtime
|
||||
8. centralized AI governance and evals
|
||||
9. AI budget and cost governance
|
||||
10. enterprise provisioning and SCIM
|
||||
|
||||
This roadmap set breaks those gaps into separate implementation documents so they
|
||||
can be sequenced without mixing concerns.
|
||||
|
||||
---
|
||||
|
||||
## Roadmap Set
|
||||
|
||||
1. [Agent Runtime & Orchestration Roadmap](./platform_AGENT_RUNTIME_ORCHESTRATION_ROADMAP.md)
|
||||
2. [Org, Workspace & RBAC Roadmap](./platform_ORG_WORKSPACE_RBAC_ROADMAP.md)
|
||||
3. [Agent Registry & Prompt Versioning Roadmap](./platform_AGENT_REGISTRY_PROMPT_VERSIONING_ROADMAP.md)
|
||||
4. [Knowledge & RAG Service Roadmap](./platform_KNOWLEDGE_RAG_SERVICE_ROADMAP.md)
|
||||
5. [Human Review & Approval Queue Roadmap](./platform_HUMAN_REVIEW_APPROVAL_QUEUE_ROADMAP.md)
|
||||
6. [Support Case Management Roadmap](./platform_SUPPORT_CASE_MANAGEMENT_ROADMAP.md)
|
||||
7. [Durable Event Bus & Worker Runtime Roadmap](./platform_DURABLE_EVENT_BUS_AND_WORKER_RUNTIME_ROADMAP.md)
|
||||
8. [AI Governance & Evaluation Roadmap](./platform_AI_GOVERNANCE_EVALS_ROADMAP.md)
|
||||
9. [AI Budget & Cost Governance Roadmap](./platform_AI_BUDGET_COST_GOVERNANCE_ROADMAP.md)
|
||||
10. [Enterprise Provisioning & SCIM Roadmap](./platform_ENTERPRISE_PROVISIONING_SCIM_ROADMAP.md)
|
||||
|
||||
---
|
||||
|
||||
## Existing Repo Signals
|
||||
|
||||
These gaps are not theoretical. The current codebase already shows the partial
|
||||
foundations and the missing layers:
|
||||
|
||||
- Durable queue primitives now exist in `packages/queue/`, but agent orchestration in
|
||||
`services/mcp-server/src/modules/a2a/runner.ts` is still primarily log-driven.
|
||||
- `platform-service` has broad product infrastructure, but there is no first-class
|
||||
org/workspace/team module under `services/platform-service/src/modules/`.
|
||||
- Enterprise IdP support exists in `services/platform-service/src/modules/auth/enterprise/`,
|
||||
but enterprise provisioning does not.
|
||||
- The event bus in `packages/events/src/memory.ts` is in-process only.
|
||||
- `ai-diagnostics` already uses embeddings and vector similarity, but there is no reusable
|
||||
knowledge service for general agent retrieval.
|
||||
- MFA push approvals exist, but there is no general review queue for agent actions.
|
||||
|
||||
---
|
||||
|
||||
## Recommended Build Order
|
||||
|
||||
### P0
|
||||
|
||||
1. Agent Runtime & Orchestration
|
||||
2. Durable Event Bus & Worker Runtime
|
||||
3. Org, Workspace & RBAC
|
||||
4. Human Review & Approval Queue
|
||||
|
||||
### P1
|
||||
|
||||
5. Agent Registry & Prompt Versioning
|
||||
6. Knowledge & RAG Service
|
||||
7. AI Budget & Cost Governance
|
||||
8. AI Governance & Evaluation
|
||||
|
||||
### P2
|
||||
|
||||
9. Enterprise Provisioning & SCIM
|
||||
10. Support Case Management
|
||||
|
||||
---
|
||||
|
||||
## Architectural Guidance
|
||||
|
||||
These docs assume the current repo direction remains:
|
||||
|
||||
- TypeScript + Fastify services
|
||||
- shared `@bytelyst/*` packages
|
||||
- `platform-service` as control plane
|
||||
- `mcp-server` as operator and A2A interface
|
||||
|
||||
However, some missing capabilities are more naturally relational or workflow-heavy
|
||||
than the current Cosmos-first platform modules. Each roadmap therefore includes:
|
||||
|
||||
- a **recommended stack** for long-term quality
|
||||
- a **repo-fit alternative** that stays closer to current conventions
|
||||
|
||||
That is intentional. The best industry-standard choice is not always the same as
|
||||
the least disruptive repo-local choice.
|
||||
@ -0,0 +1,94 @@
|
||||
# Agent Registry & Prompt Versioning Roadmap
|
||||
|
||||
> **Purpose:** Create a system of record for agents, prompts, tools, versions,
|
||||
> rollout states, and release governance.
|
||||
>
|
||||
> **Primary Surfaces:** `services/platform-service/`, `services/mcp-server/`
|
||||
>
|
||||
> **Status:** Planned
|
||||
>
|
||||
> **Estimated Effort:** 2-3 weeks
|
||||
|
||||
---
|
||||
|
||||
## Why This Is Missing
|
||||
|
||||
The repo has MCP tools and A2A pipelines, but it does not have a persistent registry
|
||||
for the definitions that power them. Without that, agent behavior is embedded in code
|
||||
and docs rather than treated as versioned platform data.
|
||||
|
||||
---
|
||||
|
||||
## Recommended Stack
|
||||
|
||||
- **PostgreSQL** for metadata and version history
|
||||
- **Blob storage or Git-backed artifacts** for prompt files and larger assets
|
||||
- **OpenTelemetry** for linking versions to production runs
|
||||
|
||||
### Repo-Fit Alternative
|
||||
|
||||
- Cosmos-backed registry module in `platform-service`
|
||||
- Prompt artifacts stored in blob storage
|
||||
- MCP server resolves active versions from `platform-service`
|
||||
|
||||
---
|
||||
|
||||
## Phase 1 - Core Registry
|
||||
|
||||
- [ ] Create modules:
|
||||
- [ ] `agent-registry`
|
||||
- [ ] `prompt-registry`
|
||||
- [ ] `tool-bundles`
|
||||
- [ ] Add entities:
|
||||
- [ ] `AgentDefinition`
|
||||
- [ ] `AgentVersion`
|
||||
- [ ] `PromptTemplate`
|
||||
- [ ] `PromptVersion`
|
||||
- [ ] `ToolBundle`
|
||||
- [ ] `ReleaseChannel`
|
||||
- [ ] Track:
|
||||
- [ ] owner
|
||||
- [ ] changelog
|
||||
- [ ] status: `draft`, `staged`, `active`, `deprecated`, `archived`
|
||||
- [ ] compatibility constraints
|
||||
|
||||
**Acceptance Criteria**
|
||||
|
||||
- Every production agent has a durable version record
|
||||
- Prompt changes are diffable and auditable
|
||||
|
||||
---
|
||||
|
||||
## Phase 2 - Rollouts & Safety
|
||||
|
||||
- [ ] Add staged rollouts by product, org, workspace, or cohort
|
||||
- [ ] Add allowlists and freeze controls
|
||||
- [ ] Add prompt approval requirement for sensitive agents
|
||||
- [ ] Add rollback support
|
||||
- [ ] Link agent versions to eval results and incidents
|
||||
|
||||
---
|
||||
|
||||
## Phase 3 - Runtime Integration
|
||||
|
||||
- [ ] `mcp-server` loads active definitions from registry rather than code-only defaults
|
||||
- [ ] `agent-runs` store `agentVersion` and `promptVersion`
|
||||
- [ ] support replay against older versions for regression analysis
|
||||
|
||||
---
|
||||
|
||||
## Tech Stack Options
|
||||
|
||||
| Option | Pros | Cons | Fit |
|
||||
| -------------------------------- | -------------------------- | ------------------------------ | --------------------------- |
|
||||
| PostgreSQL + blob storage | Strong relational history | New datastore | Best long-term |
|
||||
| Git as source of truth + sync DB | Great developer ergonomics | Dual-source complexity | Good for prompt-heavy teams |
|
||||
| Cosmos + blob storage | Consistent with repo | Version queries less ergonomic | Good short-term |
|
||||
|
||||
---
|
||||
|
||||
## Risks
|
||||
|
||||
- Code-only prompt management creates invisible production drift
|
||||
- Without version pinning, incident replay and audit are weak
|
||||
- Registry without rollout controls is just a metadata catalog
|
||||
@ -0,0 +1,146 @@
|
||||
# Agent Runtime & Orchestration Roadmap
|
||||
|
||||
> **Purpose:** Build a durable execution layer for agent runs, step transitions,
|
||||
> cancellations, retries, resumability, and operator-visible history.
|
||||
>
|
||||
> **Primary Surfaces:** `services/platform-service/`, `services/mcp-server/`,
|
||||
> `packages/queue/`
|
||||
>
|
||||
> **Status:** Planned
|
||||
>
|
||||
> **Estimated Effort:** 3-5 weeks
|
||||
|
||||
---
|
||||
|
||||
## Why This Is Missing
|
||||
|
||||
The repo now has durable queue primitives, but agent execution is still not a
|
||||
first-class platform service. A2A pipelines in `services/mcp-server/src/modules/a2a/`
|
||||
are composed code paths rather than durable runs with persistent step state.
|
||||
|
||||
That is enough for prototypes. It is not enough for:
|
||||
|
||||
- long-running multi-step agents
|
||||
- retries after process restarts
|
||||
- human escalation in the middle of a run
|
||||
- cancellation and pause/resume
|
||||
- auditable run history
|
||||
|
||||
---
|
||||
|
||||
## Recommended Stack
|
||||
|
||||
### Best Long-Term Industry Standard
|
||||
|
||||
- **Temporal** for workflow orchestration
|
||||
- **PostgreSQL** for run metadata and operator queries
|
||||
- **Redis** for short-lived coordination and cache
|
||||
|
||||
### Best Repo-Fit Option
|
||||
|
||||
- `@bytelyst/queue` for durable job dispatch
|
||||
- `platform-service` run records in Cosmos or datastore abstraction
|
||||
- `mcp-server` as orchestration client and tool executor
|
||||
|
||||
### Recommendation
|
||||
|
||||
Start with the repo-fit option to get durable runs quickly, but design the run model
|
||||
so a later move to Temporal is possible without rewriting every agent contract.
|
||||
|
||||
---
|
||||
|
||||
## Phase 1 - Canonical Run Model
|
||||
|
||||
- [ ] Create `services/platform-service/src/modules/agent-runs/`
|
||||
- [ ] Define `AgentRunDoc`, `AgentRunStepDoc`, `AgentRunEventDoc`
|
||||
- [ ] Support states: `queued`, `running`, `waiting_for_input`, `paused`, `succeeded`, `failed`, `cancelled`
|
||||
- [ ] Add `parentRunId`, `workflowId`, `agentId`, `agentVersion`, `triggerSource`
|
||||
- [ ] Persist step inputs, outputs, timings, error summaries, and correlation IDs
|
||||
- [ ] Add APIs:
|
||||
- [ ] `POST /agent-runs`
|
||||
- [ ] `GET /agent-runs/:id`
|
||||
- [ ] `GET /agent-runs/:id/events`
|
||||
- [ ] `POST /agent-runs/:id/cancel`
|
||||
- [ ] `POST /agent-runs/:id/pause`
|
||||
- [ ] `POST /agent-runs/:id/resume`
|
||||
|
||||
**Acceptance Criteria**
|
||||
|
||||
- Every agent run has durable metadata and step history
|
||||
- A run can be fetched after service restart
|
||||
- Cancellation and pause are explicit states, not implicit errors
|
||||
|
||||
---
|
||||
|
||||
## Phase 2 - Queue-Backed Execution
|
||||
|
||||
- [ ] Add `agent.run.execute` queue type on top of `@bytelyst/queue`
|
||||
- [ ] Add per-step retries with backoff
|
||||
- [ ] Add lease heartbeat for long-running steps
|
||||
- [ ] Add idempotency keys for replays
|
||||
- [ ] Add dead-letter handling and operator inspection
|
||||
- [ ] Record structured run events for step started/completed/failed/retried
|
||||
|
||||
**Acceptance Criteria**
|
||||
|
||||
- In-flight runs survive worker restart
|
||||
- Retried steps do not duplicate side effects when idempotency is configured
|
||||
- Dead-lettered runs are queryable and replayable
|
||||
|
||||
---
|
||||
|
||||
## Phase 3 - A2A Integration
|
||||
|
||||
- [ ] Replace direct A2A pipeline progression in `mcp-server` with run orchestration APIs
|
||||
- [ ] Make every pipeline step emit durable run events
|
||||
- [ ] Support handoff to human review queue
|
||||
- [ ] Support child runs for delegated agent tasks
|
||||
- [ ] Add run-level audit links to diagnostics, telemetry, and support systems
|
||||
|
||||
**Acceptance Criteria**
|
||||
|
||||
- `mcp-server` no longer owns the durable run state itself
|
||||
- A2A pipelines are observable step-by-step
|
||||
- Human review can pause and later resume a run
|
||||
|
||||
---
|
||||
|
||||
## Phase 4 - Operator Experience
|
||||
|
||||
- [ ] Admin UI for runs, filters, and replay
|
||||
- [ ] Timeline view per run
|
||||
- [ ] Step diff view for prompt/tool transitions
|
||||
- [ ] Cancel/retry/replay controls
|
||||
- [ ] SLOs: success rate, mean run duration, retry rate, dead-letter count
|
||||
|
||||
---
|
||||
|
||||
## Tech Stack Options
|
||||
|
||||
| Option | Pros | Cons | Fit |
|
||||
| ---------------------------- | ------------------------------------------------- | --------------------------------- | ----------------------- |
|
||||
| Temporal | Best workflow semantics, retries, signals, timers | New infra, steeper learning curve | Best long-term |
|
||||
| BullMQ + Redis + run DB | Simple, common in Node | Workflow semantics are custom | Strong practical option |
|
||||
| `@bytelyst/queue` + run docs | Lowest disruption to repo | More framework logic to build | Best immediate path |
|
||||
|
||||
---
|
||||
|
||||
## Risks
|
||||
|
||||
- Custom orchestration can become a weak in-house Temporal clone if not scoped tightly
|
||||
- If step contracts are not versioned, replay becomes unsafe
|
||||
- If all state remains in logs, operator tooling will never be reliable
|
||||
|
||||
---
|
||||
|
||||
## Recommendation
|
||||
|
||||
Implement the v1 run system inside `platform-service` using `@bytelyst/queue`, but
|
||||
borrow Temporal-style concepts from day one:
|
||||
|
||||
- workflow ID
|
||||
- run ID
|
||||
- signals
|
||||
- child runs
|
||||
- durable timers
|
||||
- explicit waiting states
|
||||
@ -0,0 +1,94 @@
|
||||
# AI Budget & Cost Governance Roadmap
|
||||
|
||||
> **Purpose:** Add per-tenant and per-agent controls for model spend, quotas,
|
||||
> budgets, alerts, and invoiceable AI usage.
|
||||
>
|
||||
> **Primary Surface:** `services/platform-service/`
|
||||
>
|
||||
> **Status:** Planned
|
||||
>
|
||||
> **Estimated Effort:** 2-3 weeks
|
||||
|
||||
---
|
||||
|
||||
## Why This Is Missing
|
||||
|
||||
Model usage is often more volatile than standard API usage. Agent companies need
|
||||
controls for:
|
||||
|
||||
- daily and monthly spend caps
|
||||
- per-workspace or per-agent budgets
|
||||
- model allowlists and deny rules
|
||||
- burst protection
|
||||
- usage attribution for billing
|
||||
|
||||
The repo has usage and billing modules, but not a dedicated AI cost governance layer.
|
||||
|
||||
---
|
||||
|
||||
## Recommended Stack
|
||||
|
||||
- `platform-service` cost governance module
|
||||
- usage ledger in Cosmos or PostgreSQL
|
||||
- provider-specific pricing tables
|
||||
- alerting through Slack, Telegram, and email
|
||||
|
||||
### Recommendation
|
||||
|
||||
This fits naturally in `platform-service`. The key is not the datastore; it is the
|
||||
quality of attribution and enforcement.
|
||||
|
||||
---
|
||||
|
||||
## Phase 1 - Usage Ledger
|
||||
|
||||
- [ ] Create modules:
|
||||
- [ ] `ai-usage`
|
||||
- [ ] `ai-budgets`
|
||||
- [ ] `ai-pricing`
|
||||
- [ ] Store:
|
||||
- [ ] tenant
|
||||
- [ ] workspace
|
||||
- [ ] agent
|
||||
- [ ] provider
|
||||
- [ ] model
|
||||
- [ ] tokens or units
|
||||
- [ ] cost estimate
|
||||
- [ ] request correlation ID
|
||||
|
||||
---
|
||||
|
||||
## Phase 2 - Enforcement
|
||||
|
||||
- [ ] Preflight budget checks before expensive calls
|
||||
- [ ] rate and spend throttles
|
||||
- [ ] model allowlists by tenant
|
||||
- [ ] degrade-to-cheaper-model policy
|
||||
- [ ] hard cap vs soft cap behavior
|
||||
|
||||
---
|
||||
|
||||
## Phase 3 - Visibility
|
||||
|
||||
- [ ] Admin reports by tenant, agent, provider, and model
|
||||
- [ ] budget burn-down alerts
|
||||
- [ ] anomaly detection for spend spikes
|
||||
- [ ] export for finance and customer invoicing
|
||||
|
||||
---
|
||||
|
||||
## Tech Stack Options
|
||||
|
||||
| Option | Pros | Cons | Fit |
|
||||
| --------------------------------------- | --------------------------------- | ----------------------------- | ------------------------- |
|
||||
| Platform-native ledger + pricing tables | Full control and tenant awareness | Requires pricing upkeep | Best fit |
|
||||
| External spend tool only | Fast bootstrap | Weak product attribution | Limited |
|
||||
| Billing-module extension only | Less module sprawl | AI-specific logic gets buried | Acceptable but less clear |
|
||||
|
||||
---
|
||||
|
||||
## Risks
|
||||
|
||||
- Without spend controls, one bad prompt or loop can create material cost
|
||||
- Without tenant attribution, enterprise billing becomes unreliable
|
||||
- Without enforcement, dashboards become retrospective only
|
||||
@ -0,0 +1,92 @@
|
||||
# AI Governance & Evaluation Roadmap
|
||||
|
||||
> **Purpose:** Centralize evals, policy enforcement, safety review, release gates,
|
||||
> and regression tracking for prompts, agents, and model behavior.
|
||||
>
|
||||
> **Primary Surfaces:** `services/platform-service/`, `services/extraction-service/`,
|
||||
> `services/mcp-server/`
|
||||
>
|
||||
> **Status:** Planned
|
||||
>
|
||||
> **Estimated Effort:** 3-5 weeks
|
||||
|
||||
---
|
||||
|
||||
## Why This Is Missing
|
||||
|
||||
The repo has useful pieces:
|
||||
|
||||
- extraction evals
|
||||
- telemetry and diagnostics
|
||||
- flags and experiments
|
||||
|
||||
What it does not have is a central AI governance surface that answers:
|
||||
|
||||
- which prompts are approved
|
||||
- which eval suite a release passed
|
||||
- what policies apply to a class of agents
|
||||
- what changed after a model or prompt update
|
||||
|
||||
---
|
||||
|
||||
## Recommended Stack
|
||||
|
||||
- `platform-service` governance modules
|
||||
- OpenTelemetry for trace-linked evidence
|
||||
- Promptfoo or a similar eval harness for offline regression
|
||||
- policy layer using code-first rules first, with optional Cedar or OPA later
|
||||
|
||||
---
|
||||
|
||||
## Phase 1 - Eval Registry
|
||||
|
||||
- [ ] Create modules:
|
||||
- [ ] `ai-evals`
|
||||
- [ ] `ai-policies`
|
||||
- [ ] `ai-releases`
|
||||
- [ ] Add entities:
|
||||
- [ ] benchmark set
|
||||
- [ ] eval run
|
||||
- [ ] eval result
|
||||
- [ ] policy decision
|
||||
- [ ] release gate
|
||||
|
||||
---
|
||||
|
||||
## Phase 2 - Policy Engine
|
||||
|
||||
- [ ] Add policy checks for:
|
||||
- [ ] allowed models
|
||||
- [ ] max temperature
|
||||
- [ ] blocked tools
|
||||
- [ ] required human review
|
||||
- [ ] tenant-specific restrictions
|
||||
- [ ] Add release gates based on eval thresholds
|
||||
- [ ] Add regression detection on prompt or model changes
|
||||
|
||||
---
|
||||
|
||||
## Phase 3 - Operational Governance
|
||||
|
||||
- [ ] Link agent and prompt versions to eval runs
|
||||
- [ ] Add incident-driven rollback recommendations
|
||||
- [ ] Add policy override audit logs
|
||||
- [ ] Add dashboards for pass rate, drift, and blocked releases
|
||||
|
||||
---
|
||||
|
||||
## Tech Stack Options
|
||||
|
||||
| Option | Pros | Cons | Fit |
|
||||
| ----------------------------- | -------------------------- | ------------------------ | ------------------- |
|
||||
| Promptfoo + platform registry | Good current ecosystem fit | Need custom service glue | Best near-term |
|
||||
| Custom eval runner only | Full control | Reinvents too much | Weak starting point |
|
||||
| OPA/Cedar-backed governance | Strong policy model | More complexity | Good phase 2+ |
|
||||
|
||||
---
|
||||
|
||||
## Risks
|
||||
|
||||
- Shipping prompts without eval gating causes avoidable regressions
|
||||
- Governance only in docs will drift from runtime
|
||||
- No policy audit trail creates enterprise trust problems
|
||||
@ -0,0 +1,94 @@
|
||||
# Durable Event Bus & Worker Runtime Roadmap
|
||||
|
||||
> **Purpose:** Replace in-process eventing and scattered background execution
|
||||
> with a durable cross-service event and worker backbone.
|
||||
>
|
||||
> **Primary Surfaces:** `packages/events/`, `packages/queue/`, `services/platform-service/`
|
||||
>
|
||||
> **Status:** Planned
|
||||
>
|
||||
> **Estimated Effort:** 3-4 weeks
|
||||
|
||||
---
|
||||
|
||||
## Why This Is Missing
|
||||
|
||||
`packages/events/src/memory.ts` is an in-process event bus. That is useful for local
|
||||
dispatch inside one process, but it is not enough for:
|
||||
|
||||
- cross-service subscriptions
|
||||
- replay
|
||||
- dead-letter handling
|
||||
- durable delivery
|
||||
- delayed fan-out
|
||||
|
||||
The new `@bytelyst/queue` package improves durable background work, but the eventing
|
||||
layer is still incomplete.
|
||||
|
||||
---
|
||||
|
||||
## Recommended Stack
|
||||
|
||||
### Best Long-Term Industry Standard
|
||||
|
||||
- **Redis Streams** or **NATS JetStream** for durable event delivery
|
||||
- `@bytelyst/queue` or BullMQ for work execution
|
||||
- OpenTelemetry for trace correlation
|
||||
|
||||
### Repo-Fit Option
|
||||
|
||||
- Add a durable adapter to `@bytelyst/events`
|
||||
- Use Redis-backed delivery first
|
||||
- Keep current memory bus as test/dev adapter
|
||||
|
||||
### Recommendation
|
||||
|
||||
Use `@bytelyst/events` as the interface, but add a durable Redis or NATS adapter.
|
||||
Do not let direct in-memory emitters remain the production default for critical flows.
|
||||
|
||||
---
|
||||
|
||||
## Phase 1 - Event Abstraction
|
||||
|
||||
- [ ] Extend `@bytelyst/events` to support pluggable backends
|
||||
- [ ] Keep `memory` for tests
|
||||
- [ ] Add `redis-streams` or `jetstream` adapter
|
||||
- [ ] Add consumer groups, ack, retry, and dead-letter support
|
||||
- [ ] Add correlation and causation IDs
|
||||
|
||||
---
|
||||
|
||||
## Phase 2 - Worker Runtime
|
||||
|
||||
- [ ] Standardize worker bootstrap pattern
|
||||
- [ ] Add handler registration, concurrency controls, leases, and health endpoints
|
||||
- [ ] Add poison-message and dead-letter inspection
|
||||
- [ ] Add scheduling and delayed dispatch
|
||||
|
||||
---
|
||||
|
||||
## Phase 3 - Service Migration
|
||||
|
||||
- [ ] Move delivery subscribers onto durable events
|
||||
- [ ] Move auth side effects off fire-and-forget local emitters
|
||||
- [ ] Move MCP/A2A transitions onto durable events where appropriate
|
||||
- [ ] Add observability for event lag and failure rate
|
||||
|
||||
---
|
||||
|
||||
## Tech Stack Options
|
||||
|
||||
| Option | Pros | Cons | Fit |
|
||||
| --------------- | ----------------------------------------------- | ------------------------------ | ------------------- |
|
||||
| NATS JetStream | Strong event semantics, lightweight | New infra and integration work | Excellent long-term |
|
||||
| Redis Streams | Familiar, easy to adopt with BullMQ-style stack | Less specialized than NATS | Best pragmatic path |
|
||||
| Kafka | Powerful at scale | Heavy operational footprint | Overkill now |
|
||||
| Memory bus only | Simple | Not durable | Dev/test only |
|
||||
|
||||
---
|
||||
|
||||
## Risks
|
||||
|
||||
- In-process events hide failures and block cross-service reliability
|
||||
- Durable queues without durable events still leave side effects fragile
|
||||
- Multiple custom worker patterns will drift without a standard runtime
|
||||
@ -0,0 +1,81 @@
|
||||
# Enterprise Provisioning & SCIM Roadmap
|
||||
|
||||
> **Purpose:** Extend enterprise identity from federation-only to full lifecycle
|
||||
> provisioning, deprovisioning, group sync, and seat governance.
|
||||
>
|
||||
> **Primary Surface:** `services/platform-service/src/modules/auth/enterprise/`
|
||||
>
|
||||
> **Status:** Planned
|
||||
>
|
||||
> **Estimated Effort:** 2-3 weeks
|
||||
|
||||
---
|
||||
|
||||
## Why This Is Missing
|
||||
|
||||
The platform already has enterprise SAML and OIDC federation. That solves login.
|
||||
It does not solve enterprise lifecycle management:
|
||||
|
||||
- just-in-time user provisioning policies
|
||||
- SCIM user sync
|
||||
- group sync
|
||||
- deprovisioning
|
||||
- seat and entitlement mapping
|
||||
|
||||
---
|
||||
|
||||
## Recommended Stack
|
||||
|
||||
- Extend `platform-service` enterprise auth
|
||||
- SCIM 2.0 endpoints in Fastify
|
||||
- org/workspace mapping from the tenant model
|
||||
- optional background sync jobs using `@bytelyst/queue`
|
||||
|
||||
---
|
||||
|
||||
## Phase 1 - SCIM Foundations
|
||||
|
||||
- [ ] Add SCIM service provider config endpoint
|
||||
- [ ] Add SCIM resource schemas
|
||||
- [ ] Add endpoints for:
|
||||
- [ ] `/scim/v2/Users`
|
||||
- [ ] `/scim/v2/Groups`
|
||||
- [ ] PATCH
|
||||
- [ ] deactivate
|
||||
- [ ] Add enterprise API tokens and audit logs
|
||||
|
||||
---
|
||||
|
||||
## Phase 2 - Provisioning Rules
|
||||
|
||||
- [ ] Map SCIM users to org/workspace memberships
|
||||
- [ ] Map groups to roles or teams
|
||||
- [ ] Support seat assignment and revocation
|
||||
- [ ] Add deprovision grace policy
|
||||
|
||||
---
|
||||
|
||||
## Phase 3 - Admin Controls
|
||||
|
||||
- [ ] Admin UI for provisioning state and sync errors
|
||||
- [ ] reconciliation jobs
|
||||
- [ ] audit exports
|
||||
- [ ] break-glass override flows
|
||||
|
||||
---
|
||||
|
||||
## Tech Stack Options
|
||||
|
||||
| Option | Pros | Cons | Fit |
|
||||
| ------------------------------- | ----------------------------------- | ------------------------------------ | --------------------------------- |
|
||||
| Native SCIM in platform-service | Full control, strong enterprise fit | Must implement spec carefully | Best long-term |
|
||||
| IdP proxy product | Faster setup | External dependency and less control | Acceptable only if needed quickly |
|
||||
| JIT only | Minimal effort | Not enough for enterprise IT | Inadequate |
|
||||
|
||||
---
|
||||
|
||||
## Risks
|
||||
|
||||
- Enterprise login without enterprise provisioning still creates admin pain
|
||||
- Group mapping drift leads to incorrect access
|
||||
- Deprovision lag is a real security risk
|
||||
@ -0,0 +1,93 @@
|
||||
# Human Review & Approval Queue Roadmap
|
||||
|
||||
> **Purpose:** Add a generic human-in-the-loop system for agent actions,
|
||||
> escalations, approvals, and quality review.
|
||||
>
|
||||
> **Primary Surface:** `services/platform-service/`
|
||||
>
|
||||
> **Status:** Planned
|
||||
>
|
||||
> **Estimated Effort:** 2-3 weeks
|
||||
|
||||
---
|
||||
|
||||
## Why This Is Missing
|
||||
|
||||
The platform has MFA push approvals, but that is a narrow auth flow. An agent company
|
||||
also needs a generic review queue for cases like:
|
||||
|
||||
- send this message
|
||||
- execute this external action
|
||||
- publish this recommendation
|
||||
- approve this prompt change
|
||||
- inspect low-confidence output
|
||||
|
||||
---
|
||||
|
||||
## Recommended Stack
|
||||
|
||||
- `platform-service` review module
|
||||
- `@bytelyst/queue` for routing and escalation timers
|
||||
- Slack and Telegram delivery adapters for reviewer notifications
|
||||
- Optional policy engine later with OpenFGA or Cedar
|
||||
|
||||
---
|
||||
|
||||
## Phase 1 - Review Objects
|
||||
|
||||
- [ ] Create modules:
|
||||
- [ ] `reviews`
|
||||
- [ ] `approvals`
|
||||
- [ ] `escalations`
|
||||
- [ ] Define review object fields:
|
||||
- [ ] subject type
|
||||
- [ ] subject ID
|
||||
- [ ] review reason
|
||||
- [ ] risk level
|
||||
- [ ] required decision type
|
||||
- [ ] assigned reviewer(s)
|
||||
- [ ] SLA and due time
|
||||
- [ ] supporting evidence
|
||||
- [ ] Add states:
|
||||
- [ ] pending
|
||||
- [ ] claimed
|
||||
- [ ] approved
|
||||
- [ ] rejected
|
||||
- [ ] expired
|
||||
- [ ] superseded
|
||||
|
||||
---
|
||||
|
||||
## Phase 2 - Workflow Integration
|
||||
|
||||
- [ ] Allow agent runs to emit `waiting_for_review`
|
||||
- [ ] Add review decision callbacks to resume or cancel runs
|
||||
- [ ] Add escalation timers and reassignment
|
||||
- [ ] Add reviewer comments and audit trail
|
||||
|
||||
---
|
||||
|
||||
## Phase 3 - Reviewer Experience
|
||||
|
||||
- [ ] API and admin UI queue
|
||||
- [ ] bulk claim and assignment
|
||||
- [ ] notification fan-out via Slack/Telegram/email
|
||||
- [ ] filters by risk, workspace, agent, age, reviewer
|
||||
|
||||
---
|
||||
|
||||
## Tech Stack Options
|
||||
|
||||
| Option | Pros | Cons | Fit |
|
||||
| --------------------------------------- | -------------------------- | ---------------------------------------------- | ------------------- |
|
||||
| Platform module + queue + notifications | Simple and aligned to repo | More UI to build | Best immediate path |
|
||||
| Commercial ticketing/workflow tool | Fast start | External dependency and poor control-plane fit | Poor long-term |
|
||||
| Dedicated BPM engine | Powerful | Too heavy for initial need | Overkill initially |
|
||||
|
||||
---
|
||||
|
||||
## Risks
|
||||
|
||||
- If approvals are only implemented ad hoc per module, policy becomes inconsistent
|
||||
- If decisions are not audit logged, enterprise trust will be weak
|
||||
- Review queues without SLA and ownership become dead letter inboxes
|
||||
@ -0,0 +1,117 @@
|
||||
# Knowledge & RAG Service Roadmap
|
||||
|
||||
> **Purpose:** Build a shared knowledge platform for ingestion, chunking,
|
||||
> embeddings, retrieval, citations, and access-controlled context assembly.
|
||||
>
|
||||
> **Primary Surfaces:** `services/platform-service/`, `services/extraction-service/`
|
||||
>
|
||||
> **Status:** Planned
|
||||
>
|
||||
> **Estimated Effort:** 4-6 weeks
|
||||
|
||||
---
|
||||
|
||||
## Why This Is Missing
|
||||
|
||||
The repo already has extraction and some vector-based diagnostics work, but there is
|
||||
no reusable platform service for general retrieval-augmented generation across
|
||||
products and agents.
|
||||
|
||||
Every serious agent company eventually needs:
|
||||
|
||||
- managed document ingestion
|
||||
- chunking and metadata
|
||||
- embeddings
|
||||
- retrieval APIs
|
||||
- citations and provenance
|
||||
- workspace-aware access control
|
||||
|
||||
---
|
||||
|
||||
## Recommended Stack
|
||||
|
||||
### Best Long-Term Industry Standard
|
||||
|
||||
- **PostgreSQL + pgvector** for integrated metadata + vector search
|
||||
- **Qdrant** if vector-first performance becomes dominant
|
||||
- **Blob storage** for source documents
|
||||
|
||||
### Cloud-Native Alternative
|
||||
|
||||
- **Azure AI Search** for retrieval
|
||||
- Cosmos or Postgres for metadata
|
||||
|
||||
### Recommendation
|
||||
|
||||
Use PostgreSQL + pgvector if you want the strongest balance of flexibility,
|
||||
ownership, and industry-standard retrieval patterns. Azure AI Search is a valid
|
||||
alternative if deep Azure integration matters more than datastore simplicity.
|
||||
|
||||
---
|
||||
|
||||
## Phase 1 - Knowledge Objects
|
||||
|
||||
- [ ] Create modules:
|
||||
- [ ] `knowledge-sources`
|
||||
- [ ] `knowledge-documents`
|
||||
- [ ] `knowledge-chunks`
|
||||
- [ ] `knowledge-indexes`
|
||||
- [ ] Add ingestion states:
|
||||
- [ ] uploaded
|
||||
- [ ] parsed
|
||||
- [ ] chunked
|
||||
- [ ] embedded
|
||||
- [ ] indexed
|
||||
- [ ] failed
|
||||
- [ ] Add source provenance:
|
||||
- [ ] filename
|
||||
- [ ] URL
|
||||
- [ ] connector type
|
||||
- [ ] page or section references
|
||||
|
||||
---
|
||||
|
||||
## Phase 2 - Retrieval Pipeline
|
||||
|
||||
- [ ] Add chunking service with configurable strategies
|
||||
- [ ] Add embedding generation pipeline
|
||||
- [ ] Add hybrid search:
|
||||
- [ ] lexical
|
||||
- [ ] vector
|
||||
- [ ] metadata filters
|
||||
- [ ] Add citation builder and quote bounds
|
||||
- [ ] Add workspace and org scoping
|
||||
|
||||
**Acceptance Criteria**
|
||||
|
||||
- Retrieval returns chunks with citations and permission-safe metadata
|
||||
- Different products can share the same retrieval API
|
||||
|
||||
---
|
||||
|
||||
## Phase 3 - Connectors
|
||||
|
||||
- [ ] File upload
|
||||
- [ ] Web page ingestion
|
||||
- [ ] Notes/workspace connector
|
||||
- [ ] Blob-backed ingestion
|
||||
- [ ] Optional Slack/Confluence/Google Drive connectors
|
||||
|
||||
---
|
||||
|
||||
## Tech Stack Options
|
||||
|
||||
| Option | Pros | Cons | Fit |
|
||||
| ------------------------ | ------------------------------------------ | ---------------------------- | -------------------------------- |
|
||||
| Postgres + pgvector | Strong standard, unified metadata + vector | Requires new datastore | Best overall |
|
||||
| Qdrant + metadata DB | Great vector performance | Two systems to operate | Good at scale |
|
||||
| Azure AI Search | Strong managed search | Vendor-tighter coupling | Best Azure-managed option |
|
||||
| Cosmos vector workaround | Least disruption | Not ideal as main RAG engine | Avoid as primary long-term stack |
|
||||
|
||||
---
|
||||
|
||||
## Risks
|
||||
|
||||
- Retrieval without access control causes data leakage between tenants
|
||||
- Retrieval without citations causes trust and compliance issues
|
||||
- Embeddings without source lifecycle management become stale quickly
|
||||
122
docs/roadmaps/not-started/platform_ORG_WORKSPACE_RBAC_ROADMAP.md
Normal file
122
docs/roadmaps/not-started/platform_ORG_WORKSPACE_RBAC_ROADMAP.md
Normal file
@ -0,0 +1,122 @@
|
||||
# Org, Workspace & RBAC Roadmap
|
||||
|
||||
> **Purpose:** Add a first-class tenant model for organizations, workspaces,
|
||||
> teams, memberships, scoped roles, and admin governance.
|
||||
>
|
||||
> **Primary Surface:** `services/platform-service/`
|
||||
>
|
||||
> **Status:** Planned
|
||||
>
|
||||
> **Estimated Effort:** 3-4 weeks
|
||||
|
||||
---
|
||||
|
||||
## Why This Is Missing
|
||||
|
||||
The platform has users and per-product memberships, but no canonical model for:
|
||||
|
||||
- organizations
|
||||
- workspaces
|
||||
- teams
|
||||
- workspace-scoped roles
|
||||
- resource ownership and sharing
|
||||
|
||||
Enterprise IdP support exists, but it does not replace a real tenant model.
|
||||
|
||||
---
|
||||
|
||||
## Recommended Stack
|
||||
|
||||
### Best Long-Term Industry Standard
|
||||
|
||||
- **PostgreSQL**
|
||||
- **Drizzle ORM** or **Prisma**
|
||||
- **OpenFGA** or Zanzibar-style authorization model for fine-grained access
|
||||
|
||||
### Best Repo-Fit Option
|
||||
|
||||
- `platform-service` module set backed by Cosmos
|
||||
- Role and membership evaluation in service code
|
||||
- Optional policy layer later using OpenFGA
|
||||
|
||||
### Recommendation
|
||||
|
||||
If tenanting will be central to the business, PostgreSQL is the better long-term
|
||||
fit because org/workspace membership is relational by nature. If short-term
|
||||
consistency matters more, start in Cosmos but keep the permission model portable.
|
||||
|
||||
---
|
||||
|
||||
## Phase 1 - Data Model
|
||||
|
||||
- [ ] Create modules:
|
||||
- [ ] `orgs`
|
||||
- [ ] `workspaces`
|
||||
- [ ] `teams`
|
||||
- [ ] `memberships`
|
||||
- [ ] `roles`
|
||||
- [ ] Define resources:
|
||||
- [ ] organization
|
||||
- [ ] workspace
|
||||
- [ ] team
|
||||
- [ ] service account
|
||||
- [ ] API key
|
||||
- [ ] Define roles:
|
||||
- [ ] `org_owner`
|
||||
- [ ] `org_admin`
|
||||
- [ ] `workspace_admin`
|
||||
- [ ] `workspace_editor`
|
||||
- [ ] `workspace_viewer`
|
||||
- [ ] `support_operator`
|
||||
- [ ] Add invitation and deprovision flows
|
||||
|
||||
**Acceptance Criteria**
|
||||
|
||||
- Every protected resource can be tied to org/workspace ownership
|
||||
- Users can belong to multiple workspaces with different roles
|
||||
|
||||
---
|
||||
|
||||
## Phase 2 - Authorization
|
||||
|
||||
- [ ] Add authorization helpers to `@bytelyst/auth` or a new `@bytelyst/authorization`
|
||||
- [ ] Evaluate permissions by resource and action
|
||||
- [ ] Add policy checks for:
|
||||
- [ ] read
|
||||
- [ ] write
|
||||
- [ ] execute
|
||||
- [ ] approve
|
||||
- [ ] administer
|
||||
- [ ] Add service account and API key scopes
|
||||
|
||||
**Acceptance Criteria**
|
||||
|
||||
- Endpoints no longer rely only on flat `admin` vs `user`
|
||||
- Policies are testable and reusable across modules
|
||||
|
||||
---
|
||||
|
||||
## Phase 3 - Product Integration
|
||||
|
||||
- [ ] Migrate existing modules that should be workspace-scoped
|
||||
- [ ] Add workspace headers or explicit route scoping
|
||||
- [ ] Connect enterprise IdP claims to org/workspace resolution
|
||||
- [ ] Add audit entries for membership and role changes
|
||||
|
||||
---
|
||||
|
||||
## Tech Stack Options
|
||||
|
||||
| Option | Pros | Cons | Fit |
|
||||
| --------------------------- | ----------------------------------- | -------------------------------- | ---------------------- |
|
||||
| PostgreSQL + OpenFGA | Best long-term for RBAC and sharing | New datastore + auth layer | Best industry-standard |
|
||||
| PostgreSQL only | Simpler than OpenFGA, still strong | Fine-grained auth gets custom | Good medium path |
|
||||
| Cosmos + service-level RBAC | Lowest disruption | Harder joins and policy richness | Good short-term |
|
||||
|
||||
---
|
||||
|
||||
## Risks
|
||||
|
||||
- Flat roles will become a blocker for enterprise and multi-agent collaboration
|
||||
- Delaying workspace boundaries causes later data migrations
|
||||
- Fine-grained sharing is hard to retrofit once data models hardcode user ownership
|
||||
@ -0,0 +1,87 @@
|
||||
# Support Case Management Roadmap
|
||||
|
||||
> **Purpose:** Build a platform-native case system for customer issues, agent
|
||||
> escalations, internal triage, and resolution tracking.
|
||||
>
|
||||
> **Primary Surfaces:** `services/platform-service/`, `services/mcp-server/`
|
||||
>
|
||||
> **Status:** Planned
|
||||
>
|
||||
> **Estimated Effort:** 3-4 weeks
|
||||
|
||||
---
|
||||
|
||||
## Why This Is Missing
|
||||
|
||||
The repo has diagnostics, telemetry, debug tooling, and support-oriented MCP helpers.
|
||||
What it lacks is a canonical case record that ties them together.
|
||||
|
||||
Without a case system, support work becomes fragmented across logs, chat, and ad hoc notes.
|
||||
|
||||
---
|
||||
|
||||
## Recommended Stack
|
||||
|
||||
- `platform-service` case module
|
||||
- Cosmos or PostgreSQL for case records
|
||||
- Blob storage for attachments and debug packs
|
||||
- Notification hooks to Slack/Telegram/email
|
||||
|
||||
### Recommendation
|
||||
|
||||
This can live comfortably in `platform-service`. If the case domain becomes highly
|
||||
relational, PostgreSQL is better. Otherwise a Cosmos-backed module is acceptable.
|
||||
|
||||
---
|
||||
|
||||
## Phase 1 - Core Case Model
|
||||
|
||||
- [ ] Create modules:
|
||||
- [ ] `cases`
|
||||
- [ ] `case-comments`
|
||||
- [ ] `case-attachments`
|
||||
- [ ] `case-links`
|
||||
- [ ] Track:
|
||||
- [ ] customer or workspace
|
||||
- [ ] severity
|
||||
- [ ] status
|
||||
- [ ] assignee
|
||||
- [ ] linked run IDs
|
||||
- [ ] linked diagnostics sessions
|
||||
- [ ] linked incidents and releases
|
||||
|
||||
---
|
||||
|
||||
## Phase 2 - Operational Workflow
|
||||
|
||||
- [ ] Add triage statuses and SLA timers
|
||||
- [ ] Add handoff between support, engineering, and operations
|
||||
- [ ] Add debug-pack ingestion
|
||||
- [ ] Add incident and case cross-links
|
||||
- [ ] Add case templates for common issue categories
|
||||
|
||||
---
|
||||
|
||||
## Phase 3 - Agent Integration
|
||||
|
||||
- [ ] Let agents create draft cases from failed or escalated runs
|
||||
- [ ] Let support operators ask MCP tools for case-linked diagnostics
|
||||
- [ ] Add case summarization and next-step suggestions
|
||||
|
||||
---
|
||||
|
||||
## Tech Stack Options
|
||||
|
||||
| Option | Pros | Cons | Fit |
|
||||
| --------------------------- | ----------------------------- | ------------------------------- | --------------------- |
|
||||
| Platform-native case module | Full control, integrates well | More work up front | Best long-term |
|
||||
| External helpdesk sync | Faster bootstrap | Split system of record | Good only if required |
|
||||
| Ticket tool only | Lowest build effort | Weak agent-platform integration | Poor strategic fit |
|
||||
|
||||
---
|
||||
|
||||
## Risks
|
||||
|
||||
- No unified case object means poor support analytics and weak escalations
|
||||
- External-only support systems hide key agent and diagnostics context
|
||||
- If cases cannot link to runs and review queues, operators lose causal context
|
||||
Loading…
Reference in New Issue
Block a user