docs(roadmaps): add agent platform gap roadmap set
This commit is contained in:
parent
8ad3e1be34
commit
d4c725a29d
@ -0,0 +1,109 @@
|
|||||||
|
# Agent Platform Gaps - Roadmap Index
|
||||||
|
|
||||||
|
> **Purpose:** Turn the current agent-company platform gaps into an actionable roadmap set.
|
||||||
|
>
|
||||||
|
> **Scope:** `learning_ai_common_plat`
|
||||||
|
>
|
||||||
|
> **Date:** 2026-03-14
|
||||||
|
>
|
||||||
|
> **Status:** Planned
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Executive Summary
|
||||||
|
|
||||||
|
The shared platform already covers a large amount of generic SaaS infrastructure:
|
||||||
|
auth, telemetry, diagnostics, flags, delivery, jobs, marketplace, billing-related
|
||||||
|
modules, extraction, MCP tooling, and durable queue primitives.
|
||||||
|
|
||||||
|
What is still missing is the **agent control plane**:
|
||||||
|
|
||||||
|
1. durable agent run orchestration
|
||||||
|
2. org/workspace/team/RBAC
|
||||||
|
3. agent registry and prompt versioning
|
||||||
|
4. reusable knowledge/RAG
|
||||||
|
5. human review and approval queue
|
||||||
|
6. support case management
|
||||||
|
7. durable cross-service eventing and worker runtime
|
||||||
|
8. centralized AI governance and evals
|
||||||
|
9. AI budget and cost governance
|
||||||
|
10. enterprise provisioning and SCIM
|
||||||
|
|
||||||
|
This roadmap set breaks those gaps into separate implementation documents so they
|
||||||
|
can be sequenced without mixing concerns.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Roadmap Set
|
||||||
|
|
||||||
|
1. [Agent Runtime & Orchestration Roadmap](./platform_AGENT_RUNTIME_ORCHESTRATION_ROADMAP.md)
|
||||||
|
2. [Org, Workspace & RBAC Roadmap](./platform_ORG_WORKSPACE_RBAC_ROADMAP.md)
|
||||||
|
3. [Agent Registry & Prompt Versioning Roadmap](./platform_AGENT_REGISTRY_PROMPT_VERSIONING_ROADMAP.md)
|
||||||
|
4. [Knowledge & RAG Service Roadmap](./platform_KNOWLEDGE_RAG_SERVICE_ROADMAP.md)
|
||||||
|
5. [Human Review & Approval Queue Roadmap](./platform_HUMAN_REVIEW_APPROVAL_QUEUE_ROADMAP.md)
|
||||||
|
6. [Support Case Management Roadmap](./platform_SUPPORT_CASE_MANAGEMENT_ROADMAP.md)
|
||||||
|
7. [Durable Event Bus & Worker Runtime Roadmap](./platform_DURABLE_EVENT_BUS_AND_WORKER_RUNTIME_ROADMAP.md)
|
||||||
|
8. [AI Governance & Evaluation Roadmap](./platform_AI_GOVERNANCE_EVALS_ROADMAP.md)
|
||||||
|
9. [AI Budget & Cost Governance Roadmap](./platform_AI_BUDGET_COST_GOVERNANCE_ROADMAP.md)
|
||||||
|
10. [Enterprise Provisioning & SCIM Roadmap](./platform_ENTERPRISE_PROVISIONING_SCIM_ROADMAP.md)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Existing Repo Signals
|
||||||
|
|
||||||
|
These gaps are not theoretical. The current codebase already shows the partial
|
||||||
|
foundations and the missing layers:
|
||||||
|
|
||||||
|
- Durable queue primitives now exist in `packages/queue/`, but agent orchestration in
|
||||||
|
`services/mcp-server/src/modules/a2a/runner.ts` is still primarily log-driven.
|
||||||
|
- `platform-service` has broad product infrastructure, but there is no first-class
|
||||||
|
org/workspace/team module under `services/platform-service/src/modules/`.
|
||||||
|
- Enterprise IdP support exists in `services/platform-service/src/modules/auth/enterprise/`,
|
||||||
|
but enterprise provisioning does not.
|
||||||
|
- The event bus in `packages/events/src/memory.ts` is in-process only.
|
||||||
|
- `ai-diagnostics` already uses embeddings and vector similarity, but there is no reusable
|
||||||
|
knowledge service for general agent retrieval.
|
||||||
|
- MFA push approvals exist, but there is no general review queue for agent actions.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Recommended Build Order
|
||||||
|
|
||||||
|
### P0
|
||||||
|
|
||||||
|
1. Agent Runtime & Orchestration
|
||||||
|
2. Durable Event Bus & Worker Runtime
|
||||||
|
3. Org, Workspace & RBAC
|
||||||
|
4. Human Review & Approval Queue
|
||||||
|
|
||||||
|
### P1
|
||||||
|
|
||||||
|
5. Agent Registry & Prompt Versioning
|
||||||
|
6. Knowledge & RAG Service
|
||||||
|
7. AI Budget & Cost Governance
|
||||||
|
8. AI Governance & Evaluation
|
||||||
|
|
||||||
|
### P2
|
||||||
|
|
||||||
|
9. Enterprise Provisioning & SCIM
|
||||||
|
10. Support Case Management
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Architectural Guidance
|
||||||
|
|
||||||
|
These docs assume the current repo direction remains:
|
||||||
|
|
||||||
|
- TypeScript + Fastify services
|
||||||
|
- shared `@bytelyst/*` packages
|
||||||
|
- `platform-service` as control plane
|
||||||
|
- `mcp-server` as operator and A2A interface
|
||||||
|
|
||||||
|
However, some missing capabilities are more naturally relational or workflow-heavy
|
||||||
|
than the current Cosmos-first platform modules. Each roadmap therefore includes:
|
||||||
|
|
||||||
|
- a **recommended stack** for long-term quality
|
||||||
|
- a **repo-fit alternative** that stays closer to current conventions
|
||||||
|
|
||||||
|
That is intentional. The best industry-standard choice is not always the same as
|
||||||
|
the least disruptive repo-local choice.
|
||||||
@ -0,0 +1,94 @@
|
|||||||
|
# Agent Registry & Prompt Versioning Roadmap
|
||||||
|
|
||||||
|
> **Purpose:** Create a system of record for agents, prompts, tools, versions,
|
||||||
|
> rollout states, and release governance.
|
||||||
|
>
|
||||||
|
> **Primary Surfaces:** `services/platform-service/`, `services/mcp-server/`
|
||||||
|
>
|
||||||
|
> **Status:** Planned
|
||||||
|
>
|
||||||
|
> **Estimated Effort:** 2-3 weeks
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Why This Is Missing
|
||||||
|
|
||||||
|
The repo has MCP tools and A2A pipelines, but it does not have a persistent registry
|
||||||
|
for the definitions that power them. Without that, agent behavior is embedded in code
|
||||||
|
and docs rather than treated as versioned platform data.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Recommended Stack
|
||||||
|
|
||||||
|
- **PostgreSQL** for metadata and version history
|
||||||
|
- **Blob storage or Git-backed artifacts** for prompt files and larger assets
|
||||||
|
- **OpenTelemetry** for linking versions to production runs
|
||||||
|
|
||||||
|
### Repo-Fit Alternative
|
||||||
|
|
||||||
|
- Cosmos-backed registry module in `platform-service`
|
||||||
|
- Prompt artifacts stored in blob storage
|
||||||
|
- MCP server resolves active versions from `platform-service`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 1 - Core Registry
|
||||||
|
|
||||||
|
- [ ] Create modules:
|
||||||
|
- [ ] `agent-registry`
|
||||||
|
- [ ] `prompt-registry`
|
||||||
|
- [ ] `tool-bundles`
|
||||||
|
- [ ] Add entities:
|
||||||
|
- [ ] `AgentDefinition`
|
||||||
|
- [ ] `AgentVersion`
|
||||||
|
- [ ] `PromptTemplate`
|
||||||
|
- [ ] `PromptVersion`
|
||||||
|
- [ ] `ToolBundle`
|
||||||
|
- [ ] `ReleaseChannel`
|
||||||
|
- [ ] Track:
|
||||||
|
- [ ] owner
|
||||||
|
- [ ] changelog
|
||||||
|
- [ ] status: `draft`, `staged`, `active`, `deprecated`, `archived`
|
||||||
|
- [ ] compatibility constraints
|
||||||
|
|
||||||
|
**Acceptance Criteria**
|
||||||
|
|
||||||
|
- Every production agent has a durable version record
|
||||||
|
- Prompt changes are diffable and auditable
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 2 - Rollouts & Safety
|
||||||
|
|
||||||
|
- [ ] Add staged rollouts by product, org, workspace, or cohort
|
||||||
|
- [ ] Add allowlists and freeze controls
|
||||||
|
- [ ] Add prompt approval requirement for sensitive agents
|
||||||
|
- [ ] Add rollback support
|
||||||
|
- [ ] Link agent versions to eval results and incidents
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 3 - Runtime Integration
|
||||||
|
|
||||||
|
- [ ] `mcp-server` loads active definitions from registry rather than code-only defaults
|
||||||
|
- [ ] `agent-runs` store `agentVersion` and `promptVersion`
|
||||||
|
- [ ] support replay against older versions for regression analysis
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Tech Stack Options
|
||||||
|
|
||||||
|
| Option | Pros | Cons | Fit |
|
||||||
|
| -------------------------------- | -------------------------- | ------------------------------ | --------------------------- |
|
||||||
|
| PostgreSQL + blob storage | Strong relational history | New datastore | Best long-term |
|
||||||
|
| Git as source of truth + sync DB | Great developer ergonomics | Dual-source complexity | Good for prompt-heavy teams |
|
||||||
|
| Cosmos + blob storage | Consistent with repo | Version queries less ergonomic | Good short-term |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Risks
|
||||||
|
|
||||||
|
- Code-only prompt management creates invisible production drift
|
||||||
|
- Without version pinning, incident replay and audit are weak
|
||||||
|
- Registry without rollout controls is just a metadata catalog
|
||||||
@ -0,0 +1,146 @@
|
|||||||
|
# Agent Runtime & Orchestration Roadmap
|
||||||
|
|
||||||
|
> **Purpose:** Build a durable execution layer for agent runs, step transitions,
|
||||||
|
> cancellations, retries, resumability, and operator-visible history.
|
||||||
|
>
|
||||||
|
> **Primary Surfaces:** `services/platform-service/`, `services/mcp-server/`,
|
||||||
|
> `packages/queue/`
|
||||||
|
>
|
||||||
|
> **Status:** Planned
|
||||||
|
>
|
||||||
|
> **Estimated Effort:** 3-5 weeks
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Why This Is Missing
|
||||||
|
|
||||||
|
The repo now has durable queue primitives, but agent execution is still not a
|
||||||
|
first-class platform service. A2A pipelines in `services/mcp-server/src/modules/a2a/`
|
||||||
|
are composed code paths rather than durable runs with persistent step state.
|
||||||
|
|
||||||
|
That is enough for prototypes. It is not enough for:
|
||||||
|
|
||||||
|
- long-running multi-step agents
|
||||||
|
- retries after process restarts
|
||||||
|
- human escalation in the middle of a run
|
||||||
|
- cancellation and pause/resume
|
||||||
|
- auditable run history
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Recommended Stack
|
||||||
|
|
||||||
|
### Best Long-Term Industry Standard
|
||||||
|
|
||||||
|
- **Temporal** for workflow orchestration
|
||||||
|
- **PostgreSQL** for run metadata and operator queries
|
||||||
|
- **Redis** for short-lived coordination and cache
|
||||||
|
|
||||||
|
### Best Repo-Fit Option
|
||||||
|
|
||||||
|
- `@bytelyst/queue` for durable job dispatch
|
||||||
|
- `platform-service` run records in Cosmos or datastore abstraction
|
||||||
|
- `mcp-server` as orchestration client and tool executor
|
||||||
|
|
||||||
|
### Recommendation
|
||||||
|
|
||||||
|
Start with the repo-fit option to get durable runs quickly, but design the run model
|
||||||
|
so a later move to Temporal is possible without rewriting every agent contract.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 1 - Canonical Run Model
|
||||||
|
|
||||||
|
- [ ] Create `services/platform-service/src/modules/agent-runs/`
|
||||||
|
- [ ] Define `AgentRunDoc`, `AgentRunStepDoc`, `AgentRunEventDoc`
|
||||||
|
- [ ] Support states: `queued`, `running`, `waiting_for_input`, `paused`, `succeeded`, `failed`, `cancelled`
|
||||||
|
- [ ] Add `parentRunId`, `workflowId`, `agentId`, `agentVersion`, `triggerSource`
|
||||||
|
- [ ] Persist step inputs, outputs, timings, error summaries, and correlation IDs
|
||||||
|
- [ ] Add APIs:
|
||||||
|
- [ ] `POST /agent-runs`
|
||||||
|
- [ ] `GET /agent-runs/:id`
|
||||||
|
- [ ] `GET /agent-runs/:id/events`
|
||||||
|
- [ ] `POST /agent-runs/:id/cancel`
|
||||||
|
- [ ] `POST /agent-runs/:id/pause`
|
||||||
|
- [ ] `POST /agent-runs/:id/resume`
|
||||||
|
|
||||||
|
**Acceptance Criteria**
|
||||||
|
|
||||||
|
- Every agent run has durable metadata and step history
|
||||||
|
- A run can be fetched after service restart
|
||||||
|
- Cancellation and pause are explicit states, not implicit errors
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 2 - Queue-Backed Execution
|
||||||
|
|
||||||
|
- [ ] Add `agent.run.execute` queue type on top of `@bytelyst/queue`
|
||||||
|
- [ ] Add per-step retries with backoff
|
||||||
|
- [ ] Add lease heartbeat for long-running steps
|
||||||
|
- [ ] Add idempotency keys for replays
|
||||||
|
- [ ] Add dead-letter handling and operator inspection
|
||||||
|
- [ ] Record structured run events for step started/completed/failed/retried
|
||||||
|
|
||||||
|
**Acceptance Criteria**
|
||||||
|
|
||||||
|
- In-flight runs survive worker restart
|
||||||
|
- Retried steps do not duplicate side effects when idempotency is configured
|
||||||
|
- Dead-lettered runs are queryable and replayable
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 3 - A2A Integration
|
||||||
|
|
||||||
|
- [ ] Replace direct A2A pipeline progression in `mcp-server` with run orchestration APIs
|
||||||
|
- [ ] Make every pipeline step emit durable run events
|
||||||
|
- [ ] Support handoff to human review queue
|
||||||
|
- [ ] Support child runs for delegated agent tasks
|
||||||
|
- [ ] Add run-level audit links to diagnostics, telemetry, and support systems
|
||||||
|
|
||||||
|
**Acceptance Criteria**
|
||||||
|
|
||||||
|
- `mcp-server` no longer owns the durable run state itself
|
||||||
|
- A2A pipelines are observable step-by-step
|
||||||
|
- Human review can pause and later resume a run
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 4 - Operator Experience
|
||||||
|
|
||||||
|
- [ ] Admin UI for runs, filters, and replay
|
||||||
|
- [ ] Timeline view per run
|
||||||
|
- [ ] Step diff view for prompt/tool transitions
|
||||||
|
- [ ] Cancel/retry/replay controls
|
||||||
|
- [ ] SLOs: success rate, mean run duration, retry rate, dead-letter count
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Tech Stack Options
|
||||||
|
|
||||||
|
| Option | Pros | Cons | Fit |
|
||||||
|
| ---------------------------- | ------------------------------------------------- | --------------------------------- | ----------------------- |
|
||||||
|
| Temporal | Best workflow semantics, retries, signals, timers | New infra, steeper learning curve | Best long-term |
|
||||||
|
| BullMQ + Redis + run DB | Simple, common in Node | Workflow semantics are custom | Strong practical option |
|
||||||
|
| `@bytelyst/queue` + run docs | Lowest disruption to repo | More framework logic to build | Best immediate path |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Risks
|
||||||
|
|
||||||
|
- Custom orchestration can become a weak in-house Temporal clone if not scoped tightly
|
||||||
|
- If step contracts are not versioned, replay becomes unsafe
|
||||||
|
- If all state remains in logs, operator tooling will never be reliable
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Recommendation
|
||||||
|
|
||||||
|
Implement the v1 run system inside `platform-service` using `@bytelyst/queue`, but
|
||||||
|
borrow Temporal-style concepts from day one:
|
||||||
|
|
||||||
|
- workflow ID
|
||||||
|
- run ID
|
||||||
|
- signals
|
||||||
|
- child runs
|
||||||
|
- durable timers
|
||||||
|
- explicit waiting states
|
||||||
@ -0,0 +1,94 @@
|
|||||||
|
# AI Budget & Cost Governance Roadmap
|
||||||
|
|
||||||
|
> **Purpose:** Add per-tenant and per-agent controls for model spend, quotas,
|
||||||
|
> budgets, alerts, and invoiceable AI usage.
|
||||||
|
>
|
||||||
|
> **Primary Surface:** `services/platform-service/`
|
||||||
|
>
|
||||||
|
> **Status:** Planned
|
||||||
|
>
|
||||||
|
> **Estimated Effort:** 2-3 weeks
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Why This Is Missing
|
||||||
|
|
||||||
|
Model usage is often more volatile than standard API usage. Agent companies need
|
||||||
|
controls for:
|
||||||
|
|
||||||
|
- daily and monthly spend caps
|
||||||
|
- per-workspace or per-agent budgets
|
||||||
|
- model allowlists and deny rules
|
||||||
|
- burst protection
|
||||||
|
- usage attribution for billing
|
||||||
|
|
||||||
|
The repo has usage and billing modules, but not a dedicated AI cost governance layer.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Recommended Stack
|
||||||
|
|
||||||
|
- `platform-service` cost governance module
|
||||||
|
- usage ledger in Cosmos or PostgreSQL
|
||||||
|
- provider-specific pricing tables
|
||||||
|
- alerting through Slack, Telegram, and email
|
||||||
|
|
||||||
|
### Recommendation
|
||||||
|
|
||||||
|
This fits naturally in `platform-service`. The key is not the datastore; it is the
|
||||||
|
quality of attribution and enforcement.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 1 - Usage Ledger
|
||||||
|
|
||||||
|
- [ ] Create modules:
|
||||||
|
- [ ] `ai-usage`
|
||||||
|
- [ ] `ai-budgets`
|
||||||
|
- [ ] `ai-pricing`
|
||||||
|
- [ ] Store:
|
||||||
|
- [ ] tenant
|
||||||
|
- [ ] workspace
|
||||||
|
- [ ] agent
|
||||||
|
- [ ] provider
|
||||||
|
- [ ] model
|
||||||
|
- [ ] tokens or units
|
||||||
|
- [ ] cost estimate
|
||||||
|
- [ ] request correlation ID
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 2 - Enforcement
|
||||||
|
|
||||||
|
- [ ] Preflight budget checks before expensive calls
|
||||||
|
- [ ] rate and spend throttles
|
||||||
|
- [ ] model allowlists by tenant
|
||||||
|
- [ ] degrade-to-cheaper-model policy
|
||||||
|
- [ ] hard cap vs soft cap behavior
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 3 - Visibility
|
||||||
|
|
||||||
|
- [ ] Admin reports by tenant, agent, provider, and model
|
||||||
|
- [ ] budget burn-down alerts
|
||||||
|
- [ ] anomaly detection for spend spikes
|
||||||
|
- [ ] export for finance and customer invoicing
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Tech Stack Options
|
||||||
|
|
||||||
|
| Option | Pros | Cons | Fit |
|
||||||
|
| --------------------------------------- | --------------------------------- | ----------------------------- | ------------------------- |
|
||||||
|
| Platform-native ledger + pricing tables | Full control and tenant awareness | Requires pricing upkeep | Best fit |
|
||||||
|
| External spend tool only | Fast bootstrap | Weak product attribution | Limited |
|
||||||
|
| Billing-module extension only | Less module sprawl | AI-specific logic gets buried | Acceptable but less clear |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Risks
|
||||||
|
|
||||||
|
- Without spend controls, one bad prompt or loop can create material cost
|
||||||
|
- Without tenant attribution, enterprise billing becomes unreliable
|
||||||
|
- Without enforcement, dashboards become retrospective only
|
||||||
@ -0,0 +1,92 @@
|
|||||||
|
# AI Governance & Evaluation Roadmap
|
||||||
|
|
||||||
|
> **Purpose:** Centralize evals, policy enforcement, safety review, release gates,
|
||||||
|
> and regression tracking for prompts, agents, and model behavior.
|
||||||
|
>
|
||||||
|
> **Primary Surfaces:** `services/platform-service/`, `services/extraction-service/`,
|
||||||
|
> `services/mcp-server/`
|
||||||
|
>
|
||||||
|
> **Status:** Planned
|
||||||
|
>
|
||||||
|
> **Estimated Effort:** 3-5 weeks
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Why This Is Missing
|
||||||
|
|
||||||
|
The repo has useful pieces:
|
||||||
|
|
||||||
|
- extraction evals
|
||||||
|
- telemetry and diagnostics
|
||||||
|
- flags and experiments
|
||||||
|
|
||||||
|
What it does not have is a central AI governance surface that answers:
|
||||||
|
|
||||||
|
- which prompts are approved
|
||||||
|
- which eval suite a release passed
|
||||||
|
- what policies apply to a class of agents
|
||||||
|
- what changed after a model or prompt update
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Recommended Stack
|
||||||
|
|
||||||
|
- `platform-service` governance modules
|
||||||
|
- OpenTelemetry for trace-linked evidence
|
||||||
|
- Promptfoo or a similar eval harness for offline regression
|
||||||
|
- policy layer using code-first rules first, with optional Cedar or OPA later
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 1 - Eval Registry
|
||||||
|
|
||||||
|
- [ ] Create modules:
|
||||||
|
- [ ] `ai-evals`
|
||||||
|
- [ ] `ai-policies`
|
||||||
|
- [ ] `ai-releases`
|
||||||
|
- [ ] Add entities:
|
||||||
|
- [ ] benchmark set
|
||||||
|
- [ ] eval run
|
||||||
|
- [ ] eval result
|
||||||
|
- [ ] policy decision
|
||||||
|
- [ ] release gate
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 2 - Policy Engine
|
||||||
|
|
||||||
|
- [ ] Add policy checks for:
|
||||||
|
- [ ] allowed models
|
||||||
|
- [ ] max temperature
|
||||||
|
- [ ] blocked tools
|
||||||
|
- [ ] required human review
|
||||||
|
- [ ] tenant-specific restrictions
|
||||||
|
- [ ] Add release gates based on eval thresholds
|
||||||
|
- [ ] Add regression detection on prompt or model changes
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 3 - Operational Governance
|
||||||
|
|
||||||
|
- [ ] Link agent and prompt versions to eval runs
|
||||||
|
- [ ] Add incident-driven rollback recommendations
|
||||||
|
- [ ] Add policy override audit logs
|
||||||
|
- [ ] Add dashboards for pass rate, drift, and blocked releases
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Tech Stack Options
|
||||||
|
|
||||||
|
| Option | Pros | Cons | Fit |
|
||||||
|
| ----------------------------- | -------------------------- | ------------------------ | ------------------- |
|
||||||
|
| Promptfoo + platform registry | Good current ecosystem fit | Need custom service glue | Best near-term |
|
||||||
|
| Custom eval runner only | Full control | Reinvents too much | Weak starting point |
|
||||||
|
| OPA/Cedar-backed governance | Strong policy model | More complexity | Good phase 2+ |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Risks
|
||||||
|
|
||||||
|
- Shipping prompts without eval gating causes avoidable regressions
|
||||||
|
- Governance only in docs will drift from runtime
|
||||||
|
- No policy audit trail creates enterprise trust problems
|
||||||
@ -0,0 +1,94 @@
|
|||||||
|
# Durable Event Bus & Worker Runtime Roadmap
|
||||||
|
|
||||||
|
> **Purpose:** Replace in-process eventing and scattered background execution
|
||||||
|
> with a durable cross-service event and worker backbone.
|
||||||
|
>
|
||||||
|
> **Primary Surfaces:** `packages/events/`, `packages/queue/`, `services/platform-service/`
|
||||||
|
>
|
||||||
|
> **Status:** Planned
|
||||||
|
>
|
||||||
|
> **Estimated Effort:** 3-4 weeks
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Why This Is Missing
|
||||||
|
|
||||||
|
`packages/events/src/memory.ts` is an in-process event bus. That is useful for local
|
||||||
|
dispatch inside one process, but it is not enough for:
|
||||||
|
|
||||||
|
- cross-service subscriptions
|
||||||
|
- replay
|
||||||
|
- dead-letter handling
|
||||||
|
- durable delivery
|
||||||
|
- delayed fan-out
|
||||||
|
|
||||||
|
The new `@bytelyst/queue` package improves durable background work, but the eventing
|
||||||
|
layer is still incomplete.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Recommended Stack
|
||||||
|
|
||||||
|
### Best Long-Term Industry Standard
|
||||||
|
|
||||||
|
- **Redis Streams** or **NATS JetStream** for durable event delivery
|
||||||
|
- `@bytelyst/queue` or BullMQ for work execution
|
||||||
|
- OpenTelemetry for trace correlation
|
||||||
|
|
||||||
|
### Repo-Fit Option
|
||||||
|
|
||||||
|
- Add a durable adapter to `@bytelyst/events`
|
||||||
|
- Use Redis-backed delivery first
|
||||||
|
- Keep current memory bus as test/dev adapter
|
||||||
|
|
||||||
|
### Recommendation
|
||||||
|
|
||||||
|
Use `@bytelyst/events` as the interface, but add a durable Redis or NATS adapter.
|
||||||
|
Do not let direct in-memory emitters remain the production default for critical flows.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 1 - Event Abstraction
|
||||||
|
|
||||||
|
- [ ] Extend `@bytelyst/events` to support pluggable backends
|
||||||
|
- [ ] Keep `memory` for tests
|
||||||
|
- [ ] Add `redis-streams` or `jetstream` adapter
|
||||||
|
- [ ] Add consumer groups, ack, retry, and dead-letter support
|
||||||
|
- [ ] Add correlation and causation IDs
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 2 - Worker Runtime
|
||||||
|
|
||||||
|
- [ ] Standardize worker bootstrap pattern
|
||||||
|
- [ ] Add handler registration, concurrency controls, leases, and health endpoints
|
||||||
|
- [ ] Add poison-message and dead-letter inspection
|
||||||
|
- [ ] Add scheduling and delayed dispatch
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 3 - Service Migration
|
||||||
|
|
||||||
|
- [ ] Move delivery subscribers onto durable events
|
||||||
|
- [ ] Move auth side effects off fire-and-forget local emitters
|
||||||
|
- [ ] Move MCP/A2A transitions onto durable events where appropriate
|
||||||
|
- [ ] Add observability for event lag and failure rate
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Tech Stack Options
|
||||||
|
|
||||||
|
| Option | Pros | Cons | Fit |
|
||||||
|
| --------------- | ----------------------------------------------- | ------------------------------ | ------------------- |
|
||||||
|
| NATS JetStream | Strong event semantics, lightweight | New infra and integration work | Excellent long-term |
|
||||||
|
| Redis Streams | Familiar, easy to adopt with BullMQ-style stack | Less specialized than NATS | Best pragmatic path |
|
||||||
|
| Kafka | Powerful at scale | Heavy operational footprint | Overkill now |
|
||||||
|
| Memory bus only | Simple | Not durable | Dev/test only |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Risks
|
||||||
|
|
||||||
|
- In-process events hide failures and block cross-service reliability
|
||||||
|
- Durable queues without durable events still leave side effects fragile
|
||||||
|
- Multiple custom worker patterns will drift without a standard runtime
|
||||||
@ -0,0 +1,81 @@
|
|||||||
|
# Enterprise Provisioning & SCIM Roadmap
|
||||||
|
|
||||||
|
> **Purpose:** Extend enterprise identity from federation-only to full lifecycle
|
||||||
|
> provisioning, deprovisioning, group sync, and seat governance.
|
||||||
|
>
|
||||||
|
> **Primary Surface:** `services/platform-service/src/modules/auth/enterprise/`
|
||||||
|
>
|
||||||
|
> **Status:** Planned
|
||||||
|
>
|
||||||
|
> **Estimated Effort:** 2-3 weeks
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Why This Is Missing
|
||||||
|
|
||||||
|
The platform already has enterprise SAML and OIDC federation. That solves login.
|
||||||
|
It does not solve enterprise lifecycle management:
|
||||||
|
|
||||||
|
- just-in-time user provisioning policies
|
||||||
|
- SCIM user sync
|
||||||
|
- group sync
|
||||||
|
- deprovisioning
|
||||||
|
- seat and entitlement mapping
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Recommended Stack
|
||||||
|
|
||||||
|
- Extend `platform-service` enterprise auth
|
||||||
|
- SCIM 2.0 endpoints in Fastify
|
||||||
|
- org/workspace mapping from the tenant model
|
||||||
|
- optional background sync jobs using `@bytelyst/queue`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 1 - SCIM Foundations
|
||||||
|
|
||||||
|
- [ ] Add SCIM service provider config endpoint
|
||||||
|
- [ ] Add SCIM resource schemas
|
||||||
|
- [ ] Add endpoints for:
|
||||||
|
- [ ] `/scim/v2/Users`
|
||||||
|
- [ ] `/scim/v2/Groups`
|
||||||
|
- [ ] PATCH
|
||||||
|
- [ ] deactivate
|
||||||
|
- [ ] Add enterprise API tokens and audit logs
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 2 - Provisioning Rules
|
||||||
|
|
||||||
|
- [ ] Map SCIM users to org/workspace memberships
|
||||||
|
- [ ] Map groups to roles or teams
|
||||||
|
- [ ] Support seat assignment and revocation
|
||||||
|
- [ ] Add deprovision grace policy
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 3 - Admin Controls
|
||||||
|
|
||||||
|
- [ ] Admin UI for provisioning state and sync errors
|
||||||
|
- [ ] reconciliation jobs
|
||||||
|
- [ ] audit exports
|
||||||
|
- [ ] break-glass override flows
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Tech Stack Options
|
||||||
|
|
||||||
|
| Option | Pros | Cons | Fit |
|
||||||
|
| ------------------------------- | ----------------------------------- | ------------------------------------ | --------------------------------- |
|
||||||
|
| Native SCIM in platform-service | Full control, strong enterprise fit | Must implement spec carefully | Best long-term |
|
||||||
|
| IdP proxy product | Faster setup | External dependency and less control | Acceptable only if needed quickly |
|
||||||
|
| JIT only | Minimal effort | Not enough for enterprise IT | Inadequate |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Risks
|
||||||
|
|
||||||
|
- Enterprise login without enterprise provisioning still creates admin pain
|
||||||
|
- Group mapping drift leads to incorrect access
|
||||||
|
- Deprovision lag is a real security risk
|
||||||
@ -0,0 +1,93 @@
|
|||||||
|
# Human Review & Approval Queue Roadmap
|
||||||
|
|
||||||
|
> **Purpose:** Add a generic human-in-the-loop system for agent actions,
|
||||||
|
> escalations, approvals, and quality review.
|
||||||
|
>
|
||||||
|
> **Primary Surface:** `services/platform-service/`
|
||||||
|
>
|
||||||
|
> **Status:** Planned
|
||||||
|
>
|
||||||
|
> **Estimated Effort:** 2-3 weeks
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Why This Is Missing
|
||||||
|
|
||||||
|
The platform has MFA push approvals, but that is a narrow auth flow. An agent company
|
||||||
|
also needs a generic review queue for cases like:
|
||||||
|
|
||||||
|
- send this message
|
||||||
|
- execute this external action
|
||||||
|
- publish this recommendation
|
||||||
|
- approve this prompt change
|
||||||
|
- inspect low-confidence output
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Recommended Stack
|
||||||
|
|
||||||
|
- `platform-service` review module
|
||||||
|
- `@bytelyst/queue` for routing and escalation timers
|
||||||
|
- Slack and Telegram delivery adapters for reviewer notifications
|
||||||
|
- Optional policy engine later with OpenFGA or Cedar
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 1 - Review Objects
|
||||||
|
|
||||||
|
- [ ] Create modules:
|
||||||
|
- [ ] `reviews`
|
||||||
|
- [ ] `approvals`
|
||||||
|
- [ ] `escalations`
|
||||||
|
- [ ] Define review object fields:
|
||||||
|
- [ ] subject type
|
||||||
|
- [ ] subject ID
|
||||||
|
- [ ] review reason
|
||||||
|
- [ ] risk level
|
||||||
|
- [ ] required decision type
|
||||||
|
- [ ] assigned reviewer(s)
|
||||||
|
- [ ] SLA and due time
|
||||||
|
- [ ] supporting evidence
|
||||||
|
- [ ] Add states:
|
||||||
|
- [ ] pending
|
||||||
|
- [ ] claimed
|
||||||
|
- [ ] approved
|
||||||
|
- [ ] rejected
|
||||||
|
- [ ] expired
|
||||||
|
- [ ] superseded
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 2 - Workflow Integration
|
||||||
|
|
||||||
|
- [ ] Allow agent runs to emit `waiting_for_review`
|
||||||
|
- [ ] Add review decision callbacks to resume or cancel runs
|
||||||
|
- [ ] Add escalation timers and reassignment
|
||||||
|
- [ ] Add reviewer comments and audit trail
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 3 - Reviewer Experience
|
||||||
|
|
||||||
|
- [ ] API and admin UI queue
|
||||||
|
- [ ] bulk claim and assignment
|
||||||
|
- [ ] notification fan-out via Slack/Telegram/email
|
||||||
|
- [ ] filters by risk, workspace, agent, age, reviewer
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Tech Stack Options
|
||||||
|
|
||||||
|
| Option | Pros | Cons | Fit |
|
||||||
|
| --------------------------------------- | -------------------------- | ---------------------------------------------- | ------------------- |
|
||||||
|
| Platform module + queue + notifications | Simple and aligned to repo | More UI to build | Best immediate path |
|
||||||
|
| Commercial ticketing/workflow tool | Fast start | External dependency and poor control-plane fit | Poor long-term |
|
||||||
|
| Dedicated BPM engine | Powerful | Too heavy for initial need | Overkill initially |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Risks
|
||||||
|
|
||||||
|
- If approvals are only implemented ad hoc per module, policy becomes inconsistent
|
||||||
|
- If decisions are not audit logged, enterprise trust will be weak
|
||||||
|
- Review queues without SLA and ownership become dead letter inboxes
|
||||||
@ -0,0 +1,117 @@
|
|||||||
|
# Knowledge & RAG Service Roadmap
|
||||||
|
|
||||||
|
> **Purpose:** Build a shared knowledge platform for ingestion, chunking,
|
||||||
|
> embeddings, retrieval, citations, and access-controlled context assembly.
|
||||||
|
>
|
||||||
|
> **Primary Surfaces:** `services/platform-service/`, `services/extraction-service/`
|
||||||
|
>
|
||||||
|
> **Status:** Planned
|
||||||
|
>
|
||||||
|
> **Estimated Effort:** 4-6 weeks
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Why This Is Missing
|
||||||
|
|
||||||
|
The repo already has extraction and some vector-based diagnostics work, but there is
|
||||||
|
no reusable platform service for general retrieval-augmented generation across
|
||||||
|
products and agents.
|
||||||
|
|
||||||
|
Every serious agent company eventually needs:
|
||||||
|
|
||||||
|
- managed document ingestion
|
||||||
|
- chunking and metadata
|
||||||
|
- embeddings
|
||||||
|
- retrieval APIs
|
||||||
|
- citations and provenance
|
||||||
|
- workspace-aware access control
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Recommended Stack
|
||||||
|
|
||||||
|
### Best Long-Term Industry Standard
|
||||||
|
|
||||||
|
- **PostgreSQL + pgvector** for integrated metadata + vector search
|
||||||
|
- **Qdrant** if vector-first performance becomes dominant
|
||||||
|
- **Blob storage** for source documents
|
||||||
|
|
||||||
|
### Cloud-Native Alternative
|
||||||
|
|
||||||
|
- **Azure AI Search** for retrieval
|
||||||
|
- Cosmos or Postgres for metadata
|
||||||
|
|
||||||
|
### Recommendation
|
||||||
|
|
||||||
|
Use PostgreSQL + pgvector if you want the strongest balance of flexibility,
|
||||||
|
ownership, and industry-standard retrieval patterns. Azure AI Search is a valid
|
||||||
|
alternative if deep Azure integration matters more than datastore simplicity.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 1 - Knowledge Objects
|
||||||
|
|
||||||
|
- [ ] Create modules:
|
||||||
|
- [ ] `knowledge-sources`
|
||||||
|
- [ ] `knowledge-documents`
|
||||||
|
- [ ] `knowledge-chunks`
|
||||||
|
- [ ] `knowledge-indexes`
|
||||||
|
- [ ] Add ingestion states:
|
||||||
|
- [ ] uploaded
|
||||||
|
- [ ] parsed
|
||||||
|
- [ ] chunked
|
||||||
|
- [ ] embedded
|
||||||
|
- [ ] indexed
|
||||||
|
- [ ] failed
|
||||||
|
- [ ] Add source provenance:
|
||||||
|
- [ ] filename
|
||||||
|
- [ ] URL
|
||||||
|
- [ ] connector type
|
||||||
|
- [ ] page or section references
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 2 - Retrieval Pipeline
|
||||||
|
|
||||||
|
- [ ] Add chunking service with configurable strategies
|
||||||
|
- [ ] Add embedding generation pipeline
|
||||||
|
- [ ] Add hybrid search:
|
||||||
|
- [ ] lexical
|
||||||
|
- [ ] vector
|
||||||
|
- [ ] metadata filters
|
||||||
|
- [ ] Add citation builder and quote bounds
|
||||||
|
- [ ] Add workspace and org scoping
|
||||||
|
|
||||||
|
**Acceptance Criteria**
|
||||||
|
|
||||||
|
- Retrieval returns chunks with citations and permission-safe metadata
|
||||||
|
- Different products can share the same retrieval API
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 3 - Connectors
|
||||||
|
|
||||||
|
- [ ] File upload
|
||||||
|
- [ ] Web page ingestion
|
||||||
|
- [ ] Notes/workspace connector
|
||||||
|
- [ ] Blob-backed ingestion
|
||||||
|
- [ ] Optional Slack/Confluence/Google Drive connectors
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Tech Stack Options
|
||||||
|
|
||||||
|
| Option | Pros | Cons | Fit |
|
||||||
|
| ------------------------ | ------------------------------------------ | ---------------------------- | -------------------------------- |
|
||||||
|
| Postgres + pgvector | Strong standard, unified metadata + vector | Requires new datastore | Best overall |
|
||||||
|
| Qdrant + metadata DB | Great vector performance | Two systems to operate | Good at scale |
|
||||||
|
| Azure AI Search | Strong managed search | Vendor-tighter coupling | Best Azure-managed option |
|
||||||
|
| Cosmos vector workaround | Least disruption | Not ideal as main RAG engine | Avoid as primary long-term stack |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Risks
|
||||||
|
|
||||||
|
- Retrieval without access control causes data leakage between tenants
|
||||||
|
- Retrieval without citations causes trust and compliance issues
|
||||||
|
- Embeddings without source lifecycle management become stale quickly
|
||||||
122
docs/roadmaps/not-started/platform_ORG_WORKSPACE_RBAC_ROADMAP.md
Normal file
122
docs/roadmaps/not-started/platform_ORG_WORKSPACE_RBAC_ROADMAP.md
Normal file
@ -0,0 +1,122 @@
|
|||||||
|
# Org, Workspace & RBAC Roadmap
|
||||||
|
|
||||||
|
> **Purpose:** Add a first-class tenant model for organizations, workspaces,
|
||||||
|
> teams, memberships, scoped roles, and admin governance.
|
||||||
|
>
|
||||||
|
> **Primary Surface:** `services/platform-service/`
|
||||||
|
>
|
||||||
|
> **Status:** Planned
|
||||||
|
>
|
||||||
|
> **Estimated Effort:** 3-4 weeks
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Why This Is Missing
|
||||||
|
|
||||||
|
The platform has users and per-product memberships, but no canonical model for:
|
||||||
|
|
||||||
|
- organizations
|
||||||
|
- workspaces
|
||||||
|
- teams
|
||||||
|
- workspace-scoped roles
|
||||||
|
- resource ownership and sharing
|
||||||
|
|
||||||
|
Enterprise IdP support exists, but it does not replace a real tenant model.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Recommended Stack
|
||||||
|
|
||||||
|
### Best Long-Term Industry Standard
|
||||||
|
|
||||||
|
- **PostgreSQL**
|
||||||
|
- **Drizzle ORM** or **Prisma**
|
||||||
|
- **OpenFGA** or Zanzibar-style authorization model for fine-grained access
|
||||||
|
|
||||||
|
### Best Repo-Fit Option
|
||||||
|
|
||||||
|
- `platform-service` module set backed by Cosmos
|
||||||
|
- Role and membership evaluation in service code
|
||||||
|
- Optional policy layer later using OpenFGA
|
||||||
|
|
||||||
|
### Recommendation
|
||||||
|
|
||||||
|
If tenanting will be central to the business, PostgreSQL is the better long-term
|
||||||
|
fit because org/workspace membership is relational by nature. If short-term
|
||||||
|
consistency matters more, start in Cosmos but keep the permission model portable.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 1 - Data Model
|
||||||
|
|
||||||
|
- [ ] Create modules:
|
||||||
|
- [ ] `orgs`
|
||||||
|
- [ ] `workspaces`
|
||||||
|
- [ ] `teams`
|
||||||
|
- [ ] `memberships`
|
||||||
|
- [ ] `roles`
|
||||||
|
- [ ] Define resources:
|
||||||
|
- [ ] organization
|
||||||
|
- [ ] workspace
|
||||||
|
- [ ] team
|
||||||
|
- [ ] service account
|
||||||
|
- [ ] API key
|
||||||
|
- [ ] Define roles:
|
||||||
|
- [ ] `org_owner`
|
||||||
|
- [ ] `org_admin`
|
||||||
|
- [ ] `workspace_admin`
|
||||||
|
- [ ] `workspace_editor`
|
||||||
|
- [ ] `workspace_viewer`
|
||||||
|
- [ ] `support_operator`
|
||||||
|
- [ ] Add invitation and deprovision flows
|
||||||
|
|
||||||
|
**Acceptance Criteria**
|
||||||
|
|
||||||
|
- Every protected resource can be tied to org/workspace ownership
|
||||||
|
- Users can belong to multiple workspaces with different roles
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 2 - Authorization
|
||||||
|
|
||||||
|
- [ ] Add authorization helpers to `@bytelyst/auth` or a new `@bytelyst/authorization`
|
||||||
|
- [ ] Evaluate permissions by resource and action
|
||||||
|
- [ ] Add policy checks for:
|
||||||
|
- [ ] read
|
||||||
|
- [ ] write
|
||||||
|
- [ ] execute
|
||||||
|
- [ ] approve
|
||||||
|
- [ ] administer
|
||||||
|
- [ ] Add service account and API key scopes
|
||||||
|
|
||||||
|
**Acceptance Criteria**
|
||||||
|
|
||||||
|
- Endpoints no longer rely only on flat `admin` vs `user`
|
||||||
|
- Policies are testable and reusable across modules
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 3 - Product Integration
|
||||||
|
|
||||||
|
- [ ] Migrate existing modules that should be workspace-scoped
|
||||||
|
- [ ] Add workspace headers or explicit route scoping
|
||||||
|
- [ ] Connect enterprise IdP claims to org/workspace resolution
|
||||||
|
- [ ] Add audit entries for membership and role changes
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Tech Stack Options
|
||||||
|
|
||||||
|
| Option | Pros | Cons | Fit |
|
||||||
|
| --------------------------- | ----------------------------------- | -------------------------------- | ---------------------- |
|
||||||
|
| PostgreSQL + OpenFGA | Best long-term for RBAC and sharing | New datastore + auth layer | Best industry-standard |
|
||||||
|
| PostgreSQL only | Simpler than OpenFGA, still strong | Fine-grained auth gets custom | Good medium path |
|
||||||
|
| Cosmos + service-level RBAC | Lowest disruption | Harder joins and policy richness | Good short-term |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Risks
|
||||||
|
|
||||||
|
- Flat roles will become a blocker for enterprise and multi-agent collaboration
|
||||||
|
- Delaying workspace boundaries causes later data migrations
|
||||||
|
- Fine-grained sharing is hard to retrofit once data models hardcode user ownership
|
||||||
@ -0,0 +1,87 @@
|
|||||||
|
# Support Case Management Roadmap
|
||||||
|
|
||||||
|
> **Purpose:** Build a platform-native case system for customer issues, agent
|
||||||
|
> escalations, internal triage, and resolution tracking.
|
||||||
|
>
|
||||||
|
> **Primary Surfaces:** `services/platform-service/`, `services/mcp-server/`
|
||||||
|
>
|
||||||
|
> **Status:** Planned
|
||||||
|
>
|
||||||
|
> **Estimated Effort:** 3-4 weeks
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Why This Is Missing
|
||||||
|
|
||||||
|
The repo has diagnostics, telemetry, debug tooling, and support-oriented MCP helpers.
|
||||||
|
What it lacks is a canonical case record that ties them together.
|
||||||
|
|
||||||
|
Without a case system, support work becomes fragmented across logs, chat, and ad hoc notes.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Recommended Stack
|
||||||
|
|
||||||
|
- `platform-service` case module
|
||||||
|
- Cosmos or PostgreSQL for case records
|
||||||
|
- Blob storage for attachments and debug packs
|
||||||
|
- Notification hooks to Slack/Telegram/email
|
||||||
|
|
||||||
|
### Recommendation
|
||||||
|
|
||||||
|
This can live comfortably in `platform-service`. If the case domain becomes highly
|
||||||
|
relational, PostgreSQL is better. Otherwise a Cosmos-backed module is acceptable.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 1 - Core Case Model
|
||||||
|
|
||||||
|
- [ ] Create modules:
|
||||||
|
- [ ] `cases`
|
||||||
|
- [ ] `case-comments`
|
||||||
|
- [ ] `case-attachments`
|
||||||
|
- [ ] `case-links`
|
||||||
|
- [ ] Track:
|
||||||
|
- [ ] customer or workspace
|
||||||
|
- [ ] severity
|
||||||
|
- [ ] status
|
||||||
|
- [ ] assignee
|
||||||
|
- [ ] linked run IDs
|
||||||
|
- [ ] linked diagnostics sessions
|
||||||
|
- [ ] linked incidents and releases
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 2 - Operational Workflow
|
||||||
|
|
||||||
|
- [ ] Add triage statuses and SLA timers
|
||||||
|
- [ ] Add handoff between support, engineering, and operations
|
||||||
|
- [ ] Add debug-pack ingestion
|
||||||
|
- [ ] Add incident and case cross-links
|
||||||
|
- [ ] Add case templates for common issue categories
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 3 - Agent Integration
|
||||||
|
|
||||||
|
- [ ] Let agents create draft cases from failed or escalated runs
|
||||||
|
- [ ] Let support operators ask MCP tools for case-linked diagnostics
|
||||||
|
- [ ] Add case summarization and next-step suggestions
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Tech Stack Options
|
||||||
|
|
||||||
|
| Option | Pros | Cons | Fit |
|
||||||
|
| --------------------------- | ----------------------------- | ------------------------------- | --------------------- |
|
||||||
|
| Platform-native case module | Full control, integrates well | More work up front | Best long-term |
|
||||||
|
| External helpdesk sync | Faster bootstrap | Split system of record | Good only if required |
|
||||||
|
| Ticket tool only | Lowest build effort | Weak agent-platform integration | Poor strategic fit |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Risks
|
||||||
|
|
||||||
|
- No unified case object means poor support analytics and weak escalations
|
||||||
|
- External-only support systems hide key agent and diagnostics context
|
||||||
|
- If cases cannot link to runs and review queues, operators lose causal context
|
||||||
Loading…
Reference in New Issue
Block a user