docs(roadmaps): add agent platform gap roadmap set

This commit is contained in:
root 2026-03-14 14:34:08 +00:00
parent 8ad3e1be34
commit d4c725a29d
11 changed files with 1129 additions and 0 deletions

View File

@ -0,0 +1,109 @@
# Agent Platform Gaps - Roadmap Index
> **Purpose:** Turn the current agent-company platform gaps into an actionable roadmap set.
>
> **Scope:** `learning_ai_common_plat`
>
> **Date:** 2026-03-14
>
> **Status:** Planned
---
## Executive Summary
The shared platform already covers a large amount of generic SaaS infrastructure:
auth, telemetry, diagnostics, flags, delivery, jobs, marketplace, billing-related
modules, extraction, MCP tooling, and durable queue primitives.
What is still missing is the **agent control plane**:
1. durable agent run orchestration
2. org/workspace/team/RBAC
3. agent registry and prompt versioning
4. reusable knowledge/RAG
5. human review and approval queue
6. support case management
7. durable cross-service eventing and worker runtime
8. centralized AI governance and evals
9. AI budget and cost governance
10. enterprise provisioning and SCIM
This roadmap set breaks those gaps into separate implementation documents so they
can be sequenced without mixing concerns.
---
## Roadmap Set
1. [Agent Runtime & Orchestration Roadmap](./platform_AGENT_RUNTIME_ORCHESTRATION_ROADMAP.md)
2. [Org, Workspace & RBAC Roadmap](./platform_ORG_WORKSPACE_RBAC_ROADMAP.md)
3. [Agent Registry & Prompt Versioning Roadmap](./platform_AGENT_REGISTRY_PROMPT_VERSIONING_ROADMAP.md)
4. [Knowledge & RAG Service Roadmap](./platform_KNOWLEDGE_RAG_SERVICE_ROADMAP.md)
5. [Human Review & Approval Queue Roadmap](./platform_HUMAN_REVIEW_APPROVAL_QUEUE_ROADMAP.md)
6. [Support Case Management Roadmap](./platform_SUPPORT_CASE_MANAGEMENT_ROADMAP.md)
7. [Durable Event Bus & Worker Runtime Roadmap](./platform_DURABLE_EVENT_BUS_AND_WORKER_RUNTIME_ROADMAP.md)
8. [AI Governance & Evaluation Roadmap](./platform_AI_GOVERNANCE_EVALS_ROADMAP.md)
9. [AI Budget & Cost Governance Roadmap](./platform_AI_BUDGET_COST_GOVERNANCE_ROADMAP.md)
10. [Enterprise Provisioning & SCIM Roadmap](./platform_ENTERPRISE_PROVISIONING_SCIM_ROADMAP.md)
---
## Existing Repo Signals
These gaps are not theoretical. The current codebase already shows the partial
foundations and the missing layers:
- Durable queue primitives now exist in `packages/queue/`, but agent orchestration in
`services/mcp-server/src/modules/a2a/runner.ts` is still primarily log-driven.
- `platform-service` has broad product infrastructure, but there is no first-class
org/workspace/team module under `services/platform-service/src/modules/`.
- Enterprise IdP support exists in `services/platform-service/src/modules/auth/enterprise/`,
but enterprise provisioning does not.
- The event bus in `packages/events/src/memory.ts` is in-process only.
- `ai-diagnostics` already uses embeddings and vector similarity, but there is no reusable
knowledge service for general agent retrieval.
- MFA push approvals exist, but there is no general review queue for agent actions.
---
## Recommended Build Order
### P0
1. Agent Runtime & Orchestration
2. Durable Event Bus & Worker Runtime
3. Org, Workspace & RBAC
4. Human Review & Approval Queue
### P1
5. Agent Registry & Prompt Versioning
6. Knowledge & RAG Service
7. AI Budget & Cost Governance
8. AI Governance & Evaluation
### P2
9. Enterprise Provisioning & SCIM
10. Support Case Management
---
## Architectural Guidance
These docs assume the current repo direction remains:
- TypeScript + Fastify services
- shared `@bytelyst/*` packages
- `platform-service` as control plane
- `mcp-server` as operator and A2A interface
However, some missing capabilities are more naturally relational or workflow-heavy
than the current Cosmos-first platform modules. Each roadmap therefore includes:
- a **recommended stack** for long-term quality
- a **repo-fit alternative** that stays closer to current conventions
That is intentional. The best industry-standard choice is not always the same as
the least disruptive repo-local choice.

View File

@ -0,0 +1,94 @@
# Agent Registry & Prompt Versioning Roadmap
> **Purpose:** Create a system of record for agents, prompts, tools, versions,
> rollout states, and release governance.
>
> **Primary Surfaces:** `services/platform-service/`, `services/mcp-server/`
>
> **Status:** Planned
>
> **Estimated Effort:** 2-3 weeks
---
## Why This Is Missing
The repo has MCP tools and A2A pipelines, but it does not have a persistent registry
for the definitions that power them. Without that, agent behavior is embedded in code
and docs rather than treated as versioned platform data.
---
## Recommended Stack
- **PostgreSQL** for metadata and version history
- **Blob storage or Git-backed artifacts** for prompt files and larger assets
- **OpenTelemetry** for linking versions to production runs
### Repo-Fit Alternative
- Cosmos-backed registry module in `platform-service`
- Prompt artifacts stored in blob storage
- MCP server resolves active versions from `platform-service`
---
## Phase 1 - Core Registry
- [ ] Create modules:
- [ ] `agent-registry`
- [ ] `prompt-registry`
- [ ] `tool-bundles`
- [ ] Add entities:
- [ ] `AgentDefinition`
- [ ] `AgentVersion`
- [ ] `PromptTemplate`
- [ ] `PromptVersion`
- [ ] `ToolBundle`
- [ ] `ReleaseChannel`
- [ ] Track:
- [ ] owner
- [ ] changelog
- [ ] status: `draft`, `staged`, `active`, `deprecated`, `archived`
- [ ] compatibility constraints
**Acceptance Criteria**
- Every production agent has a durable version record
- Prompt changes are diffable and auditable
---
## Phase 2 - Rollouts & Safety
- [ ] Add staged rollouts by product, org, workspace, or cohort
- [ ] Add allowlists and freeze controls
- [ ] Add prompt approval requirement for sensitive agents
- [ ] Add rollback support
- [ ] Link agent versions to eval results and incidents
---
## Phase 3 - Runtime Integration
- [ ] `mcp-server` loads active definitions from registry rather than code-only defaults
- [ ] `agent-runs` store `agentVersion` and `promptVersion`
- [ ] support replay against older versions for regression analysis
---
## Tech Stack Options
| Option | Pros | Cons | Fit |
| -------------------------------- | -------------------------- | ------------------------------ | --------------------------- |
| PostgreSQL + blob storage | Strong relational history | New datastore | Best long-term |
| Git as source of truth + sync DB | Great developer ergonomics | Dual-source complexity | Good for prompt-heavy teams |
| Cosmos + blob storage | Consistent with repo | Version queries less ergonomic | Good short-term |
---
## Risks
- Code-only prompt management creates invisible production drift
- Without version pinning, incident replay and audit are weak
- Registry without rollout controls is just a metadata catalog

View File

@ -0,0 +1,146 @@
# Agent Runtime & Orchestration Roadmap
> **Purpose:** Build a durable execution layer for agent runs, step transitions,
> cancellations, retries, resumability, and operator-visible history.
>
> **Primary Surfaces:** `services/platform-service/`, `services/mcp-server/`,
> `packages/queue/`
>
> **Status:** Planned
>
> **Estimated Effort:** 3-5 weeks
---
## Why This Is Missing
The repo now has durable queue primitives, but agent execution is still not a
first-class platform service. A2A pipelines in `services/mcp-server/src/modules/a2a/`
are composed code paths rather than durable runs with persistent step state.
That is enough for prototypes. It is not enough for:
- long-running multi-step agents
- retries after process restarts
- human escalation in the middle of a run
- cancellation and pause/resume
- auditable run history
---
## Recommended Stack
### Best Long-Term Industry Standard
- **Temporal** for workflow orchestration
- **PostgreSQL** for run metadata and operator queries
- **Redis** for short-lived coordination and cache
### Best Repo-Fit Option
- `@bytelyst/queue` for durable job dispatch
- `platform-service` run records in Cosmos or datastore abstraction
- `mcp-server` as orchestration client and tool executor
### Recommendation
Start with the repo-fit option to get durable runs quickly, but design the run model
so a later move to Temporal is possible without rewriting every agent contract.
---
## Phase 1 - Canonical Run Model
- [ ] Create `services/platform-service/src/modules/agent-runs/`
- [ ] Define `AgentRunDoc`, `AgentRunStepDoc`, `AgentRunEventDoc`
- [ ] Support states: `queued`, `running`, `waiting_for_input`, `paused`, `succeeded`, `failed`, `cancelled`
- [ ] Add `parentRunId`, `workflowId`, `agentId`, `agentVersion`, `triggerSource`
- [ ] Persist step inputs, outputs, timings, error summaries, and correlation IDs
- [ ] Add APIs:
- [ ] `POST /agent-runs`
- [ ] `GET /agent-runs/:id`
- [ ] `GET /agent-runs/:id/events`
- [ ] `POST /agent-runs/:id/cancel`
- [ ] `POST /agent-runs/:id/pause`
- [ ] `POST /agent-runs/:id/resume`
**Acceptance Criteria**
- Every agent run has durable metadata and step history
- A run can be fetched after service restart
- Cancellation and pause are explicit states, not implicit errors
---
## Phase 2 - Queue-Backed Execution
- [ ] Add `agent.run.execute` queue type on top of `@bytelyst/queue`
- [ ] Add per-step retries with backoff
- [ ] Add lease heartbeat for long-running steps
- [ ] Add idempotency keys for replays
- [ ] Add dead-letter handling and operator inspection
- [ ] Record structured run events for step started/completed/failed/retried
**Acceptance Criteria**
- In-flight runs survive worker restart
- Retried steps do not duplicate side effects when idempotency is configured
- Dead-lettered runs are queryable and replayable
---
## Phase 3 - A2A Integration
- [ ] Replace direct A2A pipeline progression in `mcp-server` with run orchestration APIs
- [ ] Make every pipeline step emit durable run events
- [ ] Support handoff to human review queue
- [ ] Support child runs for delegated agent tasks
- [ ] Add run-level audit links to diagnostics, telemetry, and support systems
**Acceptance Criteria**
- `mcp-server` no longer owns the durable run state itself
- A2A pipelines are observable step-by-step
- Human review can pause and later resume a run
---
## Phase 4 - Operator Experience
- [ ] Admin UI for runs, filters, and replay
- [ ] Timeline view per run
- [ ] Step diff view for prompt/tool transitions
- [ ] Cancel/retry/replay controls
- [ ] SLOs: success rate, mean run duration, retry rate, dead-letter count
---
## Tech Stack Options
| Option | Pros | Cons | Fit |
| ---------------------------- | ------------------------------------------------- | --------------------------------- | ----------------------- |
| Temporal | Best workflow semantics, retries, signals, timers | New infra, steeper learning curve | Best long-term |
| BullMQ + Redis + run DB | Simple, common in Node | Workflow semantics are custom | Strong practical option |
| `@bytelyst/queue` + run docs | Lowest disruption to repo | More framework logic to build | Best immediate path |
---
## Risks
- Custom orchestration can become a weak in-house Temporal clone if not scoped tightly
- If step contracts are not versioned, replay becomes unsafe
- If all state remains in logs, operator tooling will never be reliable
---
## Recommendation
Implement the v1 run system inside `platform-service` using `@bytelyst/queue`, but
borrow Temporal-style concepts from day one:
- workflow ID
- run ID
- signals
- child runs
- durable timers
- explicit waiting states

View File

@ -0,0 +1,94 @@
# AI Budget & Cost Governance Roadmap
> **Purpose:** Add per-tenant and per-agent controls for model spend, quotas,
> budgets, alerts, and invoiceable AI usage.
>
> **Primary Surface:** `services/platform-service/`
>
> **Status:** Planned
>
> **Estimated Effort:** 2-3 weeks
---
## Why This Is Missing
Model usage is often more volatile than standard API usage. Agent companies need
controls for:
- daily and monthly spend caps
- per-workspace or per-agent budgets
- model allowlists and deny rules
- burst protection
- usage attribution for billing
The repo has usage and billing modules, but not a dedicated AI cost governance layer.
---
## Recommended Stack
- `platform-service` cost governance module
- usage ledger in Cosmos or PostgreSQL
- provider-specific pricing tables
- alerting through Slack, Telegram, and email
### Recommendation
This fits naturally in `platform-service`. The key is not the datastore; it is the
quality of attribution and enforcement.
---
## Phase 1 - Usage Ledger
- [ ] Create modules:
- [ ] `ai-usage`
- [ ] `ai-budgets`
- [ ] `ai-pricing`
- [ ] Store:
- [ ] tenant
- [ ] workspace
- [ ] agent
- [ ] provider
- [ ] model
- [ ] tokens or units
- [ ] cost estimate
- [ ] request correlation ID
---
## Phase 2 - Enforcement
- [ ] Preflight budget checks before expensive calls
- [ ] rate and spend throttles
- [ ] model allowlists by tenant
- [ ] degrade-to-cheaper-model policy
- [ ] hard cap vs soft cap behavior
---
## Phase 3 - Visibility
- [ ] Admin reports by tenant, agent, provider, and model
- [ ] budget burn-down alerts
- [ ] anomaly detection for spend spikes
- [ ] export for finance and customer invoicing
---
## Tech Stack Options
| Option | Pros | Cons | Fit |
| --------------------------------------- | --------------------------------- | ----------------------------- | ------------------------- |
| Platform-native ledger + pricing tables | Full control and tenant awareness | Requires pricing upkeep | Best fit |
| External spend tool only | Fast bootstrap | Weak product attribution | Limited |
| Billing-module extension only | Less module sprawl | AI-specific logic gets buried | Acceptable but less clear |
---
## Risks
- Without spend controls, one bad prompt or loop can create material cost
- Without tenant attribution, enterprise billing becomes unreliable
- Without enforcement, dashboards become retrospective only

View File

@ -0,0 +1,92 @@
# AI Governance & Evaluation Roadmap
> **Purpose:** Centralize evals, policy enforcement, safety review, release gates,
> and regression tracking for prompts, agents, and model behavior.
>
> **Primary Surfaces:** `services/platform-service/`, `services/extraction-service/`,
> `services/mcp-server/`
>
> **Status:** Planned
>
> **Estimated Effort:** 3-5 weeks
---
## Why This Is Missing
The repo has useful pieces:
- extraction evals
- telemetry and diagnostics
- flags and experiments
What it does not have is a central AI governance surface that answers:
- which prompts are approved
- which eval suite a release passed
- what policies apply to a class of agents
- what changed after a model or prompt update
---
## Recommended Stack
- `platform-service` governance modules
- OpenTelemetry for trace-linked evidence
- Promptfoo or a similar eval harness for offline regression
- policy layer using code-first rules first, with optional Cedar or OPA later
---
## Phase 1 - Eval Registry
- [ ] Create modules:
- [ ] `ai-evals`
- [ ] `ai-policies`
- [ ] `ai-releases`
- [ ] Add entities:
- [ ] benchmark set
- [ ] eval run
- [ ] eval result
- [ ] policy decision
- [ ] release gate
---
## Phase 2 - Policy Engine
- [ ] Add policy checks for:
- [ ] allowed models
- [ ] max temperature
- [ ] blocked tools
- [ ] required human review
- [ ] tenant-specific restrictions
- [ ] Add release gates based on eval thresholds
- [ ] Add regression detection on prompt or model changes
---
## Phase 3 - Operational Governance
- [ ] Link agent and prompt versions to eval runs
- [ ] Add incident-driven rollback recommendations
- [ ] Add policy override audit logs
- [ ] Add dashboards for pass rate, drift, and blocked releases
---
## Tech Stack Options
| Option | Pros | Cons | Fit |
| ----------------------------- | -------------------------- | ------------------------ | ------------------- |
| Promptfoo + platform registry | Good current ecosystem fit | Need custom service glue | Best near-term |
| Custom eval runner only | Full control | Reinvents too much | Weak starting point |
| OPA/Cedar-backed governance | Strong policy model | More complexity | Good phase 2+ |
---
## Risks
- Shipping prompts without eval gating causes avoidable regressions
- Governance only in docs will drift from runtime
- No policy audit trail creates enterprise trust problems

View File

@ -0,0 +1,94 @@
# Durable Event Bus & Worker Runtime Roadmap
> **Purpose:** Replace in-process eventing and scattered background execution
> with a durable cross-service event and worker backbone.
>
> **Primary Surfaces:** `packages/events/`, `packages/queue/`, `services/platform-service/`
>
> **Status:** Planned
>
> **Estimated Effort:** 3-4 weeks
---
## Why This Is Missing
`packages/events/src/memory.ts` is an in-process event bus. That is useful for local
dispatch inside one process, but it is not enough for:
- cross-service subscriptions
- replay
- dead-letter handling
- durable delivery
- delayed fan-out
The new `@bytelyst/queue` package improves durable background work, but the eventing
layer is still incomplete.
---
## Recommended Stack
### Best Long-Term Industry Standard
- **Redis Streams** or **NATS JetStream** for durable event delivery
- `@bytelyst/queue` or BullMQ for work execution
- OpenTelemetry for trace correlation
### Repo-Fit Option
- Add a durable adapter to `@bytelyst/events`
- Use Redis-backed delivery first
- Keep current memory bus as test/dev adapter
### Recommendation
Use `@bytelyst/events` as the interface, but add a durable Redis or NATS adapter.
Do not let direct in-memory emitters remain the production default for critical flows.
---
## Phase 1 - Event Abstraction
- [ ] Extend `@bytelyst/events` to support pluggable backends
- [ ] Keep `memory` for tests
- [ ] Add `redis-streams` or `jetstream` adapter
- [ ] Add consumer groups, ack, retry, and dead-letter support
- [ ] Add correlation and causation IDs
---
## Phase 2 - Worker Runtime
- [ ] Standardize worker bootstrap pattern
- [ ] Add handler registration, concurrency controls, leases, and health endpoints
- [ ] Add poison-message and dead-letter inspection
- [ ] Add scheduling and delayed dispatch
---
## Phase 3 - Service Migration
- [ ] Move delivery subscribers onto durable events
- [ ] Move auth side effects off fire-and-forget local emitters
- [ ] Move MCP/A2A transitions onto durable events where appropriate
- [ ] Add observability for event lag and failure rate
---
## Tech Stack Options
| Option | Pros | Cons | Fit |
| --------------- | ----------------------------------------------- | ------------------------------ | ------------------- |
| NATS JetStream | Strong event semantics, lightweight | New infra and integration work | Excellent long-term |
| Redis Streams | Familiar, easy to adopt with BullMQ-style stack | Less specialized than NATS | Best pragmatic path |
| Kafka | Powerful at scale | Heavy operational footprint | Overkill now |
| Memory bus only | Simple | Not durable | Dev/test only |
---
## Risks
- In-process events hide failures and block cross-service reliability
- Durable queues without durable events still leave side effects fragile
- Multiple custom worker patterns will drift without a standard runtime

View File

@ -0,0 +1,81 @@
# Enterprise Provisioning & SCIM Roadmap
> **Purpose:** Extend enterprise identity from federation-only to full lifecycle
> provisioning, deprovisioning, group sync, and seat governance.
>
> **Primary Surface:** `services/platform-service/src/modules/auth/enterprise/`
>
> **Status:** Planned
>
> **Estimated Effort:** 2-3 weeks
---
## Why This Is Missing
The platform already has enterprise SAML and OIDC federation. That solves login.
It does not solve enterprise lifecycle management:
- just-in-time user provisioning policies
- SCIM user sync
- group sync
- deprovisioning
- seat and entitlement mapping
---
## Recommended Stack
- Extend `platform-service` enterprise auth
- SCIM 2.0 endpoints in Fastify
- org/workspace mapping from the tenant model
- optional background sync jobs using `@bytelyst/queue`
---
## Phase 1 - SCIM Foundations
- [ ] Add SCIM service provider config endpoint
- [ ] Add SCIM resource schemas
- [ ] Add endpoints for:
- [ ] `/scim/v2/Users`
- [ ] `/scim/v2/Groups`
- [ ] PATCH
- [ ] deactivate
- [ ] Add enterprise API tokens and audit logs
---
## Phase 2 - Provisioning Rules
- [ ] Map SCIM users to org/workspace memberships
- [ ] Map groups to roles or teams
- [ ] Support seat assignment and revocation
- [ ] Add deprovision grace policy
---
## Phase 3 - Admin Controls
- [ ] Admin UI for provisioning state and sync errors
- [ ] reconciliation jobs
- [ ] audit exports
- [ ] break-glass override flows
---
## Tech Stack Options
| Option | Pros | Cons | Fit |
| ------------------------------- | ----------------------------------- | ------------------------------------ | --------------------------------- |
| Native SCIM in platform-service | Full control, strong enterprise fit | Must implement spec carefully | Best long-term |
| IdP proxy product | Faster setup | External dependency and less control | Acceptable only if needed quickly |
| JIT only | Minimal effort | Not enough for enterprise IT | Inadequate |
---
## Risks
- Enterprise login without enterprise provisioning still creates admin pain
- Group mapping drift leads to incorrect access
- Deprovision lag is a real security risk

View File

@ -0,0 +1,93 @@
# Human Review & Approval Queue Roadmap
> **Purpose:** Add a generic human-in-the-loop system for agent actions,
> escalations, approvals, and quality review.
>
> **Primary Surface:** `services/platform-service/`
>
> **Status:** Planned
>
> **Estimated Effort:** 2-3 weeks
---
## Why This Is Missing
The platform has MFA push approvals, but that is a narrow auth flow. An agent company
also needs a generic review queue for cases like:
- send this message
- execute this external action
- publish this recommendation
- approve this prompt change
- inspect low-confidence output
---
## Recommended Stack
- `platform-service` review module
- `@bytelyst/queue` for routing and escalation timers
- Slack and Telegram delivery adapters for reviewer notifications
- Optional policy engine later with OpenFGA or Cedar
---
## Phase 1 - Review Objects
- [ ] Create modules:
- [ ] `reviews`
- [ ] `approvals`
- [ ] `escalations`
- [ ] Define review object fields:
- [ ] subject type
- [ ] subject ID
- [ ] review reason
- [ ] risk level
- [ ] required decision type
- [ ] assigned reviewer(s)
- [ ] SLA and due time
- [ ] supporting evidence
- [ ] Add states:
- [ ] pending
- [ ] claimed
- [ ] approved
- [ ] rejected
- [ ] expired
- [ ] superseded
---
## Phase 2 - Workflow Integration
- [ ] Allow agent runs to emit `waiting_for_review`
- [ ] Add review decision callbacks to resume or cancel runs
- [ ] Add escalation timers and reassignment
- [ ] Add reviewer comments and audit trail
---
## Phase 3 - Reviewer Experience
- [ ] API and admin UI queue
- [ ] bulk claim and assignment
- [ ] notification fan-out via Slack/Telegram/email
- [ ] filters by risk, workspace, agent, age, reviewer
---
## Tech Stack Options
| Option | Pros | Cons | Fit |
| --------------------------------------- | -------------------------- | ---------------------------------------------- | ------------------- |
| Platform module + queue + notifications | Simple and aligned to repo | More UI to build | Best immediate path |
| Commercial ticketing/workflow tool | Fast start | External dependency and poor control-plane fit | Poor long-term |
| Dedicated BPM engine | Powerful | Too heavy for initial need | Overkill initially |
---
## Risks
- If approvals are only implemented ad hoc per module, policy becomes inconsistent
- If decisions are not audit logged, enterprise trust will be weak
- Review queues without SLA and ownership become dead letter inboxes

View File

@ -0,0 +1,117 @@
# Knowledge & RAG Service Roadmap
> **Purpose:** Build a shared knowledge platform for ingestion, chunking,
> embeddings, retrieval, citations, and access-controlled context assembly.
>
> **Primary Surfaces:** `services/platform-service/`, `services/extraction-service/`
>
> **Status:** Planned
>
> **Estimated Effort:** 4-6 weeks
---
## Why This Is Missing
The repo already has extraction and some vector-based diagnostics work, but there is
no reusable platform service for general retrieval-augmented generation across
products and agents.
Every serious agent company eventually needs:
- managed document ingestion
- chunking and metadata
- embeddings
- retrieval APIs
- citations and provenance
- workspace-aware access control
---
## Recommended Stack
### Best Long-Term Industry Standard
- **PostgreSQL + pgvector** for integrated metadata + vector search
- **Qdrant** if vector-first performance becomes dominant
- **Blob storage** for source documents
### Cloud-Native Alternative
- **Azure AI Search** for retrieval
- Cosmos or Postgres for metadata
### Recommendation
Use PostgreSQL + pgvector if you want the strongest balance of flexibility,
ownership, and industry-standard retrieval patterns. Azure AI Search is a valid
alternative if deep Azure integration matters more than datastore simplicity.
---
## Phase 1 - Knowledge Objects
- [ ] Create modules:
- [ ] `knowledge-sources`
- [ ] `knowledge-documents`
- [ ] `knowledge-chunks`
- [ ] `knowledge-indexes`
- [ ] Add ingestion states:
- [ ] uploaded
- [ ] parsed
- [ ] chunked
- [ ] embedded
- [ ] indexed
- [ ] failed
- [ ] Add source provenance:
- [ ] filename
- [ ] URL
- [ ] connector type
- [ ] page or section references
---
## Phase 2 - Retrieval Pipeline
- [ ] Add chunking service with configurable strategies
- [ ] Add embedding generation pipeline
- [ ] Add hybrid search:
- [ ] lexical
- [ ] vector
- [ ] metadata filters
- [ ] Add citation builder and quote bounds
- [ ] Add workspace and org scoping
**Acceptance Criteria**
- Retrieval returns chunks with citations and permission-safe metadata
- Different products can share the same retrieval API
---
## Phase 3 - Connectors
- [ ] File upload
- [ ] Web page ingestion
- [ ] Notes/workspace connector
- [ ] Blob-backed ingestion
- [ ] Optional Slack/Confluence/Google Drive connectors
---
## Tech Stack Options
| Option | Pros | Cons | Fit |
| ------------------------ | ------------------------------------------ | ---------------------------- | -------------------------------- |
| Postgres + pgvector | Strong standard, unified metadata + vector | Requires new datastore | Best overall |
| Qdrant + metadata DB | Great vector performance | Two systems to operate | Good at scale |
| Azure AI Search | Strong managed search | Vendor-tighter coupling | Best Azure-managed option |
| Cosmos vector workaround | Least disruption | Not ideal as main RAG engine | Avoid as primary long-term stack |
---
## Risks
- Retrieval without access control causes data leakage between tenants
- Retrieval without citations causes trust and compliance issues
- Embeddings without source lifecycle management become stale quickly

View File

@ -0,0 +1,122 @@
# Org, Workspace & RBAC Roadmap
> **Purpose:** Add a first-class tenant model for organizations, workspaces,
> teams, memberships, scoped roles, and admin governance.
>
> **Primary Surface:** `services/platform-service/`
>
> **Status:** Planned
>
> **Estimated Effort:** 3-4 weeks
---
## Why This Is Missing
The platform has users and per-product memberships, but no canonical model for:
- organizations
- workspaces
- teams
- workspace-scoped roles
- resource ownership and sharing
Enterprise IdP support exists, but it does not replace a real tenant model.
---
## Recommended Stack
### Best Long-Term Industry Standard
- **PostgreSQL**
- **Drizzle ORM** or **Prisma**
- **OpenFGA** or Zanzibar-style authorization model for fine-grained access
### Best Repo-Fit Option
- `platform-service` module set backed by Cosmos
- Role and membership evaluation in service code
- Optional policy layer later using OpenFGA
### Recommendation
If tenanting will be central to the business, PostgreSQL is the better long-term
fit because org/workspace membership is relational by nature. If short-term
consistency matters more, start in Cosmos but keep the permission model portable.
---
## Phase 1 - Data Model
- [ ] Create modules:
- [ ] `orgs`
- [ ] `workspaces`
- [ ] `teams`
- [ ] `memberships`
- [ ] `roles`
- [ ] Define resources:
- [ ] organization
- [ ] workspace
- [ ] team
- [ ] service account
- [ ] API key
- [ ] Define roles:
- [ ] `org_owner`
- [ ] `org_admin`
- [ ] `workspace_admin`
- [ ] `workspace_editor`
- [ ] `workspace_viewer`
- [ ] `support_operator`
- [ ] Add invitation and deprovision flows
**Acceptance Criteria**
- Every protected resource can be tied to org/workspace ownership
- Users can belong to multiple workspaces with different roles
---
## Phase 2 - Authorization
- [ ] Add authorization helpers to `@bytelyst/auth` or a new `@bytelyst/authorization`
- [ ] Evaluate permissions by resource and action
- [ ] Add policy checks for:
- [ ] read
- [ ] write
- [ ] execute
- [ ] approve
- [ ] administer
- [ ] Add service account and API key scopes
**Acceptance Criteria**
- Endpoints no longer rely only on flat `admin` vs `user`
- Policies are testable and reusable across modules
---
## Phase 3 - Product Integration
- [ ] Migrate existing modules that should be workspace-scoped
- [ ] Add workspace headers or explicit route scoping
- [ ] Connect enterprise IdP claims to org/workspace resolution
- [ ] Add audit entries for membership and role changes
---
## Tech Stack Options
| Option | Pros | Cons | Fit |
| --------------------------- | ----------------------------------- | -------------------------------- | ---------------------- |
| PostgreSQL + OpenFGA | Best long-term for RBAC and sharing | New datastore + auth layer | Best industry-standard |
| PostgreSQL only | Simpler than OpenFGA, still strong | Fine-grained auth gets custom | Good medium path |
| Cosmos + service-level RBAC | Lowest disruption | Harder joins and policy richness | Good short-term |
---
## Risks
- Flat roles will become a blocker for enterprise and multi-agent collaboration
- Delaying workspace boundaries causes later data migrations
- Fine-grained sharing is hard to retrofit once data models hardcode user ownership

View File

@ -0,0 +1,87 @@
# Support Case Management Roadmap
> **Purpose:** Build a platform-native case system for customer issues, agent
> escalations, internal triage, and resolution tracking.
>
> **Primary Surfaces:** `services/platform-service/`, `services/mcp-server/`
>
> **Status:** Planned
>
> **Estimated Effort:** 3-4 weeks
---
## Why This Is Missing
The repo has diagnostics, telemetry, debug tooling, and support-oriented MCP helpers.
What it lacks is a canonical case record that ties them together.
Without a case system, support work becomes fragmented across logs, chat, and ad hoc notes.
---
## Recommended Stack
- `platform-service` case module
- Cosmos or PostgreSQL for case records
- Blob storage for attachments and debug packs
- Notification hooks to Slack/Telegram/email
### Recommendation
This can live comfortably in `platform-service`. If the case domain becomes highly
relational, PostgreSQL is better. Otherwise a Cosmos-backed module is acceptable.
---
## Phase 1 - Core Case Model
- [ ] Create modules:
- [ ] `cases`
- [ ] `case-comments`
- [ ] `case-attachments`
- [ ] `case-links`
- [ ] Track:
- [ ] customer or workspace
- [ ] severity
- [ ] status
- [ ] assignee
- [ ] linked run IDs
- [ ] linked diagnostics sessions
- [ ] linked incidents and releases
---
## Phase 2 - Operational Workflow
- [ ] Add triage statuses and SLA timers
- [ ] Add handoff between support, engineering, and operations
- [ ] Add debug-pack ingestion
- [ ] Add incident and case cross-links
- [ ] Add case templates for common issue categories
---
## Phase 3 - Agent Integration
- [ ] Let agents create draft cases from failed or escalated runs
- [ ] Let support operators ask MCP tools for case-linked diagnostics
- [ ] Add case summarization and next-step suggestions
---
## Tech Stack Options
| Option | Pros | Cons | Fit |
| --------------------------- | ----------------------------- | ------------------------------- | --------------------- |
| Platform-native case module | Full control, integrates well | More work up front | Best long-term |
| External helpdesk sync | Faster bootstrap | Split system of record | Good only if required |
| Ticket tool only | Lowest build effort | Weak agent-platform integration | Poor strategic fit |
---
## Risks
- No unified case object means poor support analytics and weak escalations
- External-only support systems hide key agent and diagnostics context
- If cases cannot link to runs and review queues, operators lose causal context