docs(roadmaps): detailed 2026-04-17 plans for event bus, RAG, agent runtime, approval queue
Four grounded roadmaps superseding the scaffolded versions with current-state audits, data models, week-by-week phase plans, tech-stack decisions, and acceptance checklists. Execution order: Event Bus (P0) unlocks Agent Runtime + RAG (P1 parallel), which unlock Approval Queue (P1).
This commit is contained in:
parent
7d266bfcc0
commit
46a16f06bc
@ -0,0 +1,183 @@
|
||||
# Agent Runtime & Orchestration — Detailed Roadmap (2026-04-17)
|
||||
|
||||
> **Supersedes:** `platform_AGENT_RUNTIME_ORCHESTRATION_ROADMAP.md` (scaffold).
|
||||
> **Status:** Planned — enables platform-hosted multi-step agent runs reusable across products.
|
||||
> **Primary surfaces:** `services/platform-service/src/modules/{agent-runtime,agents,runs}/`, `services/mcp-server/src/modules/a2a/`, `services/cowork-service/`.
|
||||
> **Estimated effort:** 3–5 weeks (1 engineer) / 2–3 weeks (2 engineers).
|
||||
> **Priority:** **P1** — closes the gap between cowork (desktop agent) and platform-hosted agents invoked by product backends or admins.
|
||||
|
||||
---
|
||||
|
||||
## 1. Current State (as of 2026-04-17)
|
||||
|
||||
| Surface | State | Evidence |
|
||||
| ---------------------------------------- | -------------------------------------------------------------------------------------------------- | ---------------------------------------------------- |
|
||||
| `platform-service/modules/agent-runtime` | Routes only; no persistent state model | `routes.ts`, `routes.test.ts` |
|
||||
| `platform-service/modules/agents` | Fuller: `repository`, `routes`, `executor`, `tool-registry`, `types` — but in-memory execution | 8 files in `agents/` |
|
||||
| `platform-service/modules/runs` | Run tracker (`tracker.ts`), repository, routes — already models run lifecycle but not queue-backed | 7 files |
|
||||
| `platform-service/modules/ai-budgets` | Exists — enforces spend per product | `ai-budgets/` |
|
||||
| `platform-service/modules/agent-evals` | Exists — eval harness skeleton | `agent-evals/` |
|
||||
| `mcp-server/modules/a2a` | A2A pipelines as composed code paths; no durable step state | `services/mcp-server/src/modules/a2a/` |
|
||||
| `cowork-service/modules/agent-runtime` | IPC bridge to cowork-orchestrator `submit_task` | `services/cowork-service/src/modules/agent-runtime/` |
|
||||
| `@bytelyst/queue` | `QueueWorker` + memory/file stores. No cross-process yet | `packages/queue/` |
|
||||
|
||||
**Net:** the run data model is partially there; what's missing is the **durable queue-backed execution** plus **unified orchestration** so cowork, product-backend agents, and admin-initiated runs share one run API.
|
||||
|
||||
---
|
||||
|
||||
## 2. Success Criteria (V1)
|
||||
|
||||
1. An `POST /api/agent-runs` request creates a queued run; executor worker picks it up and progresses through steps with full persistence.
|
||||
2. Kill the executor mid-step → run resumes from the last completed step on restart without duplicated side effects (idempotency on step outputs).
|
||||
3. A run can enter `waiting_for_review` (see Approval Queue roadmap) and resume when a reviewer decides.
|
||||
4. Cowork sessions created via `ipc-bridge` register as platform agent runs and appear in `/ops/runs`.
|
||||
5. Each run carries `aiBudgetId` and writes per-step token usage; over-budget runs pause, not crash.
|
||||
6. 300+ concurrent runs sustained with per-run latency < 2× single-run baseline.
|
||||
|
||||
---
|
||||
|
||||
## 3. Non-Goals (V1)
|
||||
|
||||
- Full Temporal adoption (contract is Temporal-compatible for future migration; no Temporal infra yet)
|
||||
- User-authored workflow DSL
|
||||
- Multi-region run migration
|
||||
- Streaming step outputs to clients (add post-V1 via SSE + existing `@bytelyst/fastify-sse`)
|
||||
|
||||
---
|
||||
|
||||
## 4. Dependencies
|
||||
|
||||
- **Hard:** Durable Event Bus & Worker Runtime (C1)
|
||||
- **Hard:** `@bytelyst/queue` Redis adapter (Phase 2 of C1)
|
||||
- **Soft:** Human Review Approval Queue (C4) for `waiting_for_review` semantics
|
||||
- **Soft:** Agent Registry & Prompt Versioning (existing scaffolded doc) — for deterministic `agentVersion` binding
|
||||
|
||||
---
|
||||
|
||||
## 5. Canonical Run Model
|
||||
|
||||
New/expanded Cosmos containers (PK in **bold**):
|
||||
|
||||
| Container | PK | Purpose |
|
||||
| ------------------ | ------------- | ---------------------------------------------------------------------------------------------------------- |
|
||||
| `agent-runs` | **productId** | `AgentRunDoc`: state, workflowId, agentId, agentVersion, triggerSource, parentRunId, aiBudgetId, createdBy |
|
||||
| `agent-run-steps` | **runId** | Step inputs/outputs/timings/errors; idempotency key; `state` |
|
||||
| `agent-run-events` | **runId** | Durable event log (mirror into event store) for replay/operator view |
|
||||
| `agent-run-leases` | **runId** | Worker lease with expiry — enables safe failover |
|
||||
|
||||
Run states: `queued → running → waiting_for_input | waiting_for_review | paused → running → (succeeded | failed | cancelled)`.
|
||||
|
||||
Step states: `pending → running → (succeeded | failed | retrying | skipped)`.
|
||||
|
||||
Every doc includes `productId` + `workspaceId` + `correlationId`.
|
||||
|
||||
---
|
||||
|
||||
## 6. Phase Plan
|
||||
|
||||
### Phase 1 — Durable Run Model (week 1)
|
||||
|
||||
- [ ] **1.1** Extend `agent-runtime/types.ts` with `AgentRunDoc`, `AgentRunStepDoc`, `AgentRunEventDoc`, `AgentRunLeaseDoc` (zod schemas)
|
||||
- [ ] **1.2** Repository methods: `createRun`, `getRun`, `listRuns`, `appendStep`, `updateStepState`, `acquireLease`, `renewLease`, `releaseLease`
|
||||
- [ ] **1.3** Migrate existing `runs/` module → rename to legacy alias; new agent-runs module becomes source of truth
|
||||
- [ ] **1.4** REST surface:
|
||||
- [ ] `POST /api/agent-runs` → creates run, publishes `agent.run.execute` event
|
||||
- [ ] `GET /api/agent-runs/:id`
|
||||
- [ ] `GET /api/agent-runs/:id/events` (paginated + optional SSE)
|
||||
- [ ] `POST /api/agent-runs/:id/cancel` → sets `cancelRequested` flag read by executor
|
||||
- [ ] `POST /api/agent-runs/:id/pause` / `/resume`
|
||||
- [ ] `POST /api/agent-runs/:id/signal` — named signal (for approval, input, tool-result)
|
||||
- [ ] **1.5** 25+ route tests using `@bytelyst/testing` fastify inject helpers
|
||||
|
||||
**Exit:** run lifecycle transitions persist across fastify restart; all state queryable.
|
||||
|
||||
---
|
||||
|
||||
### Phase 2 — Queue-Backed Executor (week 2)
|
||||
|
||||
- [ ] **2.1** Add `agent.run.execute` queue type on top of Durable Event Bus + `@bytelyst/queue`
|
||||
- [ ] **2.2** `AgentRunExecutor` class in `modules/agent-runtime/executor.ts`:
|
||||
- Acquires lease (30s heartbeat); holds until step completes
|
||||
- For each step: resolves agent spec → calls tool (`agents/tool-registry.ts`) → records output with idempotency key
|
||||
- On failure: exponential backoff, max retries per step from `agentVersion.retryPolicy`
|
||||
- DLQ on exhausted retries; run → `failed`
|
||||
- [ ] **2.3** Idempotency:
|
||||
- Caller can pass `idempotencyKey` on `POST /agent-runs`; repeat within 24h returns existing run
|
||||
- Each step output stored by `{runId, stepIndex, attempt}`; external tool calls opt-in to idempotency via handler contract
|
||||
- [ ] **2.4** Budget integration: pre-step check against `ai-budgets`; over-budget → run → `waiting_for_input` with reason `budget_exceeded`
|
||||
- [ ] **2.5** Cancellation: `cancelRequested` polled between steps; in-flight step finishes cleanly, run → `cancelled`
|
||||
|
||||
**Exit:** chaos test — kill executor process during a 5-step run. Run resumes from last checkpoint. Idempotent tool not double-invoked.
|
||||
|
||||
---
|
||||
|
||||
### Phase 3 — A2A + Cowork Integration (week 3)
|
||||
|
||||
- [ ] **3.1** `mcp-server/modules/a2a` refactor: pipeline steps become agent-run steps. Pipeline definition → stored as workflow template, not code path.
|
||||
- [ ] **3.2** `cowork-service/modules/agent-runtime` updates: when cowork submits a task via IPC, also register as platform agent run (dual-write with `triggerSource=cowork`)
|
||||
- [ ] **3.3** Cross-system correlation: cowork's `taskId` stored as external ref on agent-run; admin `/ops/runs` surfaces both sides
|
||||
- [ ] **3.4** Agent Registry integration (depends on separately scaffolded roadmap): `agentId + agentVersion` required on run creation; registry prompt snapshot frozen at run start
|
||||
|
||||
**Exit:** every cowork session visible under `/ops/runs`; every A2A pipeline surfaces step-by-step.
|
||||
|
||||
---
|
||||
|
||||
### Phase 4 — Operator Experience (week 4)
|
||||
|
||||
- [ ] **4.1** Admin UI `dashboards/admin-web/app/ops/runs/`:
|
||||
- List with filters (product, workspace, agent, state, age, reviewer)
|
||||
- Timeline view per run
|
||||
- Step diff view (prompt/tool-call/output)
|
||||
- Cancel / retry / replay controls
|
||||
- [ ] **4.2** SLO dashboard: success rate, mean duration, retry rate, DLQ count per agent
|
||||
- [ ] **4.3** `@bytelyst/platform-client` additions: `client.agentRuns.create/get/cancel/signal` for product backends
|
||||
- [ ] **4.4** Migration guide: product backends that roll their own agent loops → 1-call swap
|
||||
|
||||
**Exit:** one product backend (e.g., JarvisJr) dispatches an agent via platform runs instead of in-proc call.
|
||||
|
||||
---
|
||||
|
||||
### Phase 5 — Scale & Hardening (week 5)
|
||||
|
||||
- [ ] **5.1** Horizontal scale: 2+ executor replicas via lease-based safety
|
||||
- [ ] **5.2** Load test: 300 concurrent runs, 5 steps each, measure P50/P95/P99
|
||||
- [ ] **5.3** Backpressure: when queue depth > N, new runs enter `queued` but `POST` returns immediately; admin can see queue saturation
|
||||
- [ ] **5.4** Circuit breakers: per-agent failure-rate trips new run acceptance for 60s
|
||||
- [ ] **5.5** Observability: OpenTelemetry spans around every step; trace ID propagates to external tool calls
|
||||
|
||||
**Exit:** load test meets latency budget; operator dashboard usable under load.
|
||||
|
||||
---
|
||||
|
||||
## 7. Tech Stack Decision
|
||||
|
||||
| Option | V1? | Rationale |
|
||||
| ------------------------------------------------------- | ----------------------------- | ------------------------------------------------- |
|
||||
| **`@bytelyst/queue` + Redis adapter + Cosmos run docs** | ✅ Yes | Reuses Durable Event Bus; no new major infra |
|
||||
| Temporal | Deferred | Right answer for V2; too much infra for V1 |
|
||||
| BullMQ direct | Replaced by `@bytelyst/queue` | Same class; we'd just be skipping our abstraction |
|
||||
| SQS / EventBridge | Skip | Cloud-vendor coupling |
|
||||
|
||||
---
|
||||
|
||||
## 8. Risks & Mitigations
|
||||
|
||||
| Risk | Mitigation |
|
||||
| --------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| "Weak in-house Temporal" drift | Contract borrows Temporal terminology (workflowId, runId, signals, child runs) so future migration is mechanical |
|
||||
| Step idempotency misuse | Executor requires handlers declare `idempotent: true`; non-idempotent defaults to single-attempt + failover-safe retry only on transient errors |
|
||||
| Lease contention at scale | Leases include random jitter; stuck runs auto-released after 3× heartbeat period |
|
||||
| Product teams bypass to run local loops | Deprecation + budget-module enforcement: unregistered runs get 0 budget |
|
||||
|
||||
---
|
||||
|
||||
## 9. Acceptance Checklist
|
||||
|
||||
- [ ] Phase 1 — run model persistent; REST complete
|
||||
- [ ] Phase 2 — executor survives restart; idempotency proven
|
||||
- [ ] Phase 3 — A2A + cowork both integrated
|
||||
- [ ] Phase 4 — admin UI shipped; one product backend migrated
|
||||
- [ ] Phase 5 — 300-concurrent load test passes
|
||||
- [ ] `platform-client` published with new methods
|
||||
- [ ] `docs/ECOSYSTEM_ARCHITECTURE.md` + AGENTS.md updated
|
||||
- [ ] Roadmap moved to `docs/roadmaps/completed/`
|
||||
@ -0,0 +1,143 @@
|
||||
# Durable Event Bus & Worker Runtime — Detailed Roadmap (2026-04-17)
|
||||
|
||||
> **Supersedes:** `platform_DURABLE_EVENT_BUS_AND_WORKER_RUNTIME_ROADMAP.md` (scaffold).
|
||||
> **Status:** Planned — unblocks Agent Runtime, Approval Queue, Knowledge ingestion.
|
||||
> **Primary surfaces:** `packages/events/`, `packages/queue/`, `packages/event-store/`, `services/platform-service/`.
|
||||
> **Estimated effort:** 3–4 weeks (1 engineer) / 2 weeks (2 engineers).
|
||||
> **Priority:** **P0** — foundation for Categories C3/C4 and for reliable cross-service side effects.
|
||||
|
||||
---
|
||||
|
||||
## 1. Current State (as of 2026-04-17)
|
||||
|
||||
| Surface | State | Evidence |
|
||||
| ----------------------- | ------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------ |
|
||||
| `@bytelyst/events` | `EventBus` (memory) + `DurableEventBus` interface + typed schemas (AgentRun, AgentTask, AgentDispatch, AgentCheckpoint, Artifact, …) | `packages/events/src/index.ts` |
|
||||
| `@bytelyst/event-store` | Cosmos-backed append-only event store (exists, underused) | `packages/event-store/src/` |
|
||||
| `@bytelyst/queue` | `QueueWorker` + `MemoryQueueStore` + `FileQueueStore`. No Redis, no cross-process delivery | `packages/queue/src/index.ts` |
|
||||
| Platform subscribers | `delivery`, `notifications`, `audit`, `runs/tracker.ts` emit on the in-proc memory bus; cross-service = ad-hoc HTTP | `services/platform-service/src/modules/*/` |
|
||||
| Worker processes | None standardized. Scheduled jobs embed inside fastify via setInterval / per-module timers | n/a |
|
||||
|
||||
**Net:** we have durable-queue and durable-event-store primitives but no **standardized durable bus or worker runtime** consuming them. Production flows still ride the in-memory bus.
|
||||
|
||||
---
|
||||
|
||||
## 2. Success Criteria (V1)
|
||||
|
||||
1. One event emitted in `platform-service` is reliably consumed by `cowork-service`, `extraction-service`, or `mcp-server` within 1s P95, surviving restart on either side.
|
||||
2. Every durable subscriber records ack / retry / DLQ counts; operators can query/replay from Cosmos.
|
||||
3. A single `workers/` bootstrap pattern is documented and used by ≥3 side-effect consumers.
|
||||
4. No new module added after Phase 3 relies on the in-memory bus for production-critical flows.
|
||||
5. `@bytelyst/events` memory adapter remains the default in **test environments only** (`NODE_ENV=test`).
|
||||
|
||||
---
|
||||
|
||||
## 3. Non-Goals (V1)
|
||||
|
||||
- Kafka-scale throughput, exactly-once delivery, cross-region replication
|
||||
- Full Temporal semantics (that's the Agent Runtime roadmap)
|
||||
- Replacing the HTTP REST layer between services
|
||||
|
||||
---
|
||||
|
||||
## 4. Dependencies
|
||||
|
||||
- None (this is the foundation). Enables: Agent Runtime Orchestration, Human Review Approval Queue, Knowledge RAG ingestion.
|
||||
|
||||
---
|
||||
|
||||
## 5. Phase Plan
|
||||
|
||||
### Phase 1 — Pluggable Event Bus Backends (week 1)
|
||||
|
||||
- [ ] **1.1** Formalize `DurableEventAdapter` contract in `@bytelyst/events` (publish, subscribe, ack, nack, dlq, replay)
|
||||
- [ ] **1.2** Keep `MemoryAdapter` as the default for tests
|
||||
- [ ] **1.3** Implement `CosmosEventStoreAdapter` on top of existing `@bytelyst/event-store` (append + tail via change-feed)
|
||||
- [ ] Partition key: `productId` + day-bucket
|
||||
- [ ] Consumer groups via lease-collection (reuse `packages/cosmos/src/lease-manager` if present; else add lightweight doc-based lease)
|
||||
- [ ] **1.4** Add `correlationId` + `causationId` to every `PlatformEventSchemas` entry (schema-breaking; ship behind zod `.catchall()` fallback for unknown fields)
|
||||
- [ ] **1.5** Add `packages/events/src/durable.test.ts` — adapter conformance suite (publish/subscribe/ack/retry/DLQ)
|
||||
|
||||
**Exit:** `pnpm --filter @bytelyst/events test` covers the Cosmos adapter using the Cosmos emulator.
|
||||
|
||||
---
|
||||
|
||||
### Phase 2 — Standardized Worker Runtime (week 2)
|
||||
|
||||
- [ ] **2.1** New surface: `services/platform-service/src/workers/` with a single `createWorker(opts)` bootstrap
|
||||
- [ ] Concurrency control (default: `AGENT_WORKER_CONCURRENCY=4`)
|
||||
- [ ] Lease heartbeat (default 30s, configurable per handler)
|
||||
- [ ] Graceful shutdown on SIGINT/SIGTERM
|
||||
- [ ] Pino child logger with `workerName`, `eventId`, `correlationId` bindings
|
||||
- [ ] **2.2** Expose `/ops/workers/*` admin routes: list workers, inspect in-flight leases, force-replay DLQ item, pause/resume
|
||||
- [ ] **2.3** Health: each worker reports into `GET /api/health/dependencies` under a `workers` key
|
||||
- [ ] **2.4** Operator command: `pnpm --filter @lysnrai/platform-service worker:run <name>` for running a single worker out-of-process (k8s-ready)
|
||||
- [ ] **2.5** Dead-letter inspection: Cosmos container `events-dlq` with 30-day TTL; admin UI can re-enqueue
|
||||
|
||||
**Exit:** `reviews` module's notification side-effects migrated to durable worker. Kill the Fastify process mid-emit; event still delivered after restart.
|
||||
|
||||
---
|
||||
|
||||
### Phase 3 — Migrate Production Side Effects (week 3)
|
||||
|
||||
- [ ] **3.1** Audit in-memory `EventBus.emit()` call sites (expect ≤20) and categorize:
|
||||
- Critical (must durably deliver): auth-events, run-state-changes, billing-events, approval-events
|
||||
- Best-effort (OK to lose): dev-time dashboards, local UI refresh hints
|
||||
- [ ] **3.2** Migrate critical callers to `DurableEventBus` one module at a time:
|
||||
- [ ] `auth` side-effects (login audit, MFA trigger)
|
||||
- [ ] `runs/tracker.ts` state transitions
|
||||
- [ ] `billing-checkout` / `dunning` / `subscriptions` webhooks
|
||||
- [ ] `notifications` fan-out
|
||||
- [ ] `delivery` outbox
|
||||
- [ ] **3.3** Replace per-module `setInterval` schedulers with registered durable workers (e.g., `retention/`, `exports/`, `predictive-analytics/`)
|
||||
- [ ] **3.4** Telemetry: emit `event_published_total`, `event_consumed_total`, `event_dlq_total`, `worker_lease_expired_total` via `@bytelyst/backend-telemetry`
|
||||
- [ ] **3.5** Grafana dashboard: event-lag P95 per subscriber, DLQ size, worker concurrency saturation
|
||||
|
||||
**Exit:** post-migration, no production code path emits to `new EventBus()` in `NODE_ENV=production`. Enforced by an ESLint rule in `eslint.config.js` (`no-restricted-imports` scoped to `@bytelyst/events/memory`).
|
||||
|
||||
---
|
||||
|
||||
### Phase 4 — Cross-Service Subscriptions (week 3–4)
|
||||
|
||||
- [ ] **4.1** `cowork-service` subscribes to `agent.run.*` events and mirrors state into its own tracker (currently polls `platform-service/runs`)
|
||||
- [ ] **4.2** `extraction-service` subscribes to `knowledge.ingest.requested` (see RAG roadmap) instead of exposing REST
|
||||
- [ ] **4.3** `mcp-server` emits `a2a.pipeline.step.*` events that become the audit trail source of truth
|
||||
- [ ] **4.4** Compose: add a Redis container to `docker-compose.yml` as an **optional** fast-path adapter (later phase)
|
||||
|
||||
**Exit:** `pnpm prototype:self-test` exercises one end-to-end cross-service event.
|
||||
|
||||
---
|
||||
|
||||
## 6. Tech Stack Decision
|
||||
|
||||
| Option | V1 default? | Rationale |
|
||||
| ------------------------------ | ------------------------ | ------------------------------------------------------------------------ |
|
||||
| **Cosmos change-feed adapter** | ✅ Yes | Reuses existing emulator, persistence story, Azure runtime; no new infra |
|
||||
| Redis Streams | Later (Phase 4 optional) | Lower latency; add after Cosmos proves semantics |
|
||||
| NATS JetStream | Deferred | Best long-term; reassess at Phase 5+ |
|
||||
| Kafka | Never (overkill) | — |
|
||||
|
||||
**Default decision:** Cosmos-backed adapter first → add Redis optional adapter at Phase 4 → revisit NATS once event volume > 10 events/sec sustained.
|
||||
|
||||
---
|
||||
|
||||
## 7. Risks & Mitigations
|
||||
|
||||
| Risk | Mitigation |
|
||||
| ---------------------------------- | ---------------------------------------------------------------------------------------------------------- |
|
||||
| Cosmos change-feed lag under burst | Per-consumer lease + configurable `maxItemCount`; alert on lag > 5s |
|
||||
| Schema drift breaks replay | Every event carries `schemaVersion`; adapters refuse unknown major versions; replay path calls migrator |
|
||||
| Workers multiply process count | `services/platform-service` runs workers in-process by default; only split out when concurrency demands it |
|
||||
| In-memory bus regressions | Lint rule + CI grep blocks `new EventBus()` outside `*/test/*` paths |
|
||||
|
||||
---
|
||||
|
||||
## 8. Acceptance Checklist
|
||||
|
||||
- [ ] Phase 1 complete — adapter conformance suite passes on memory + Cosmos
|
||||
- [ ] Phase 2 complete — `/ops/workers` + DLQ admin routes shipped
|
||||
- [ ] Phase 3 complete — ≥5 modules migrated; ESLint guard active
|
||||
- [ ] Phase 4 complete — one cross-service flow demonstrated end-to-end
|
||||
- [ ] Grafana dashboard `platform-events` live
|
||||
- [ ] AGENTS.md + `docs/ECOSYSTEM_ARCHITECTURE.md` updated
|
||||
- [ ] Roadmap moved to `docs/roadmaps/completed/`
|
||||
@ -0,0 +1,173 @@
|
||||
# Human Review & Approval Queue — Detailed Roadmap (2026-04-17)
|
||||
|
||||
> **Supersedes:** `platform_HUMAN_REVIEW_APPROVAL_QUEUE_ROADMAP.md` (scaffold).
|
||||
> **Status:** Planned — critical safety rail for autonomous agents.
|
||||
> **Primary surfaces:** `services/platform-service/src/modules/reviews/`, new `modules/approvals/`, new `modules/escalations/`, `dashboards/admin-web/`.
|
||||
> **Estimated effort:** 2–3 weeks (1 engineer).
|
||||
> **Priority:** **P1** — gates every write-capable agent action to avoid blast radius incidents.
|
||||
|
||||
---
|
||||
|
||||
## 1. Current State (as of 2026-04-17)
|
||||
|
||||
| Surface | State | Evidence |
|
||||
| ---------------------------------- | ----------------------------------------------------------------------------------------------------- | ----------------------------------------------- |
|
||||
| `platform-service/modules/reviews` | Module exists: `notifications.ts`, `repository`, `routes`, `types` — partial implementation | 6 files in `reviews/` |
|
||||
| MFA push-approval | Narrow auth flow only | `auth` module |
|
||||
| Cowork destructive-action guard | `CoworkPermissionPrompter` emits `approval-required` Tauri event; resolves in-process | `crates/cowork-orchestrator/src/permissions.rs` |
|
||||
| A2A pipeline pause-for-review | Not implemented | — |
|
||||
| Product backend approval flows | Ad-hoc per product (JarvisJr, NomGap have custom) | grep confirms duplication |
|
||||
| Notifications | `notifications` module + delivery channels (Slack stub, email via Mailpit, push via `@bytelyst/push`) | `notifications/` |
|
||||
|
||||
**Net:** `reviews` module has skeleton but isn't the **generic human-in-the-loop system**. Every approval currently shipped is bespoke.
|
||||
|
||||
---
|
||||
|
||||
## 2. Success Criteria (V1)
|
||||
|
||||
1. Any agent run can emit `waiting_for_review` with a `reviewReason` and a decision contract; a reviewer's decision resumes or cancels the run.
|
||||
2. Cowork destructive actions (file delete, shell `rm`, commit push) route through the platform reviews module when `remote_approval_enabled=true` — otherwise keep local prompter.
|
||||
3. Reviewers see one inbox across products; can bulk-claim, add comments, and make decisions via admin UI or API.
|
||||
4. SLA timers: unclaimed reviews escalate at T1, auto-reject at T2 (configurable per risk level).
|
||||
5. Full audit trail: every decision captured with reviewer identity, timestamp, reason, supporting evidence pointers.
|
||||
6. One real product backend (JarvisJr or NomGap) removes its custom approval flow in favor of this.
|
||||
|
||||
---
|
||||
|
||||
## 3. Non-Goals (V1)
|
||||
|
||||
- Full BPM suite or arbitrary approval-chain designer
|
||||
- End-user case portals (reviewer-only UX in V1)
|
||||
- Risk-score auto-decisions (deferred to V2 with policy engine)
|
||||
- Legal-hold / eDiscovery extensions
|
||||
|
||||
---
|
||||
|
||||
## 4. Dependencies
|
||||
|
||||
- **Hard:** Durable Event Bus (C1) — for review events and notification fan-out
|
||||
- **Hard:** Agent Runtime Orchestration (C3) — for `waiting_for_review` state integration
|
||||
- **Soft:** Org & Workspace RBAC scaffolded roadmap — for reviewer assignment by role
|
||||
|
||||
---
|
||||
|
||||
## 5. Data Model
|
||||
|
||||
New Cosmos containers (PK in **bold**):
|
||||
|
||||
| Container | PK | Purpose |
|
||||
| -------------------- | ------------- | --------------------------------------------------------------------------------------- |
|
||||
| `reviews` | **productId** | `ReviewDoc`: subjectType, subjectId, riskLevel, requiredDecision, assignees, state, sla |
|
||||
| `review-decisions` | **reviewId** | `DecisionDoc`: decision, reviewerId, comment, decidedAt, attachments |
|
||||
| `review-escalations` | **reviewId** | `EscalationDoc`: level, triggeredAt, reassignedTo, reason |
|
||||
|
||||
Subject types: `agent_run_step`, `external_action`, `content_publication`, `prompt_change`, `low_confidence_output`, `tool_invocation`, `custom` (with `subjectPayload` JSON).
|
||||
|
||||
States: `pending → claimed → (approved | rejected | expired | superseded)`.
|
||||
|
||||
Risk levels: `low | medium | high | critical` → dictates SLA (24h / 8h / 2h / 30m default).
|
||||
|
||||
---
|
||||
|
||||
## 6. Phase Plan
|
||||
|
||||
### Phase 1 — Generic Review Object (week 1)
|
||||
|
||||
- [ ] **1.1** Extend `reviews/types.ts`: `ReviewDoc`, `DecisionDoc`, `EscalationDoc` zod schemas
|
||||
- [ ] **1.2** Repository: `createReview`, `getReview`, `listReviews`, `claimReview`, `submitDecision`, `expireReview`, `reassign`
|
||||
- [ ] **1.3** REST surface:
|
||||
- [ ] `POST /api/reviews` (service-to-service)
|
||||
- [ ] `GET /api/reviews/:id`
|
||||
- [ ] `GET /api/reviews` (filters: product, workspace, state, reviewer, risk, age)
|
||||
- [ ] `POST /api/reviews/:id/claim`
|
||||
- [ ] `POST /api/reviews/:id/decide { decision, comment }`
|
||||
- [ ] `POST /api/reviews/:id/reassign { reviewerIds }`
|
||||
- [ ] **1.4** Reviewer authz: `canReview(productId, workspaceId, userId, riskLevel)` helper (reuse `@bytelyst/org-client`)
|
||||
- [ ] **1.5** 20+ route tests covering ACL edges (cross-product, cross-workspace, expired claim)
|
||||
|
||||
**Exit:** create a review via curl → claim → decide → state machine enforced.
|
||||
|
||||
---
|
||||
|
||||
### Phase 2 — Workflow & Escalation (week 2)
|
||||
|
||||
- [ ] **2.1** SLA engine: background worker (on Durable Event Bus) scans pending reviews; escalates at T1, expires at T2
|
||||
- [ ] **2.2** Emit events: `review.created`, `review.claimed`, `review.decided`, `review.escalated`, `review.expired`
|
||||
- [ ] **2.3** Agent Runtime integration:
|
||||
- Agent run step can call `requestReview(subject, risk)` → step state `waiting_for_review` → run state `waiting_for_review`
|
||||
- `review.decided` event signals run; `approved` → continue, `rejected`/`expired` → fail step
|
||||
- [ ] **2.4** Escalation policies: per-product JSON config in `reviews/policies.ts` (default + product overrides)
|
||||
- [ ] **2.5** Supersession: if a new review for same subject arrives, the old one auto-`superseded` with audit link
|
||||
|
||||
**Exit:** integration test: agent run requests review → 5s later auto-escalates → reviewer decides → run resumes.
|
||||
|
||||
---
|
||||
|
||||
### Phase 3 — Reviewer Delivery + UI (week 2–3)
|
||||
|
||||
- [ ] **3.1** Notification fan-out on `review.created` + `review.escalated`:
|
||||
- Email via Mailpit / production SMTP
|
||||
- Slack channel webhook (reuse `notifications`)
|
||||
- Push via `@bytelyst/push` to assigned reviewer's devices
|
||||
- Telegram (optional adapter)
|
||||
- [ ] **3.2** Admin UI `dashboards/admin-web/app/ops/reviews/`:
|
||||
- Queue view with filters
|
||||
- Bulk-claim
|
||||
- Decision form with evidence pane (renders `subjectPayload`)
|
||||
- Timeline per review (events + escalations + decision)
|
||||
- [ ] **3.3** Reviewer mobile hook: `@bytelyst/react-native-platform-sdk` exposes `useReviewInbox()`
|
||||
- [ ] **3.4** Keyboard shortcuts for reviewers: `j/k` navigate, `a` approve, `r` reject, `c` comment, `e` escalate
|
||||
|
||||
**Exit:** reviewer completes 20 real reviews end-to-end with <30s median decision time on low-risk items.
|
||||
|
||||
---
|
||||
|
||||
### Phase 4 — Cowork + First Product Migration (week 3)
|
||||
|
||||
- [ ] **4.1** Cowork integration:
|
||||
- New flag `remote_approval_enabled` on cowork-service
|
||||
- When true, `CoworkPermissionPrompter` calls `platform-service POST /api/reviews` instead of emitting local Tauri event
|
||||
- Polls / subscribes for decision; resumes tool call on approval
|
||||
- [ ] **4.2** Rust-side adapter in `crates/cowork-orchestrator/src/permissions.rs` with HTTP fallback when platform unreachable → local prompt
|
||||
- [ ] **4.3** Migrate one product backend (recommend JarvisJr) off custom approval flow:
|
||||
- Identify all `needsApproval`/`pendingConfirm` sites
|
||||
- Replace with `platformClient.reviews.create()`
|
||||
- Delete product-local approval tables after 30-day sunset
|
||||
- [ ] **4.4** Document migration playbook for remaining products
|
||||
|
||||
**Exit:** JarvisJr custom approval code deleted; all approvals in admin queue.
|
||||
|
||||
---
|
||||
|
||||
## 7. Tech Stack Decision
|
||||
|
||||
| Option | V1? | Rationale |
|
||||
| ------------------------------------------------------------------------ | ------ | ---------------------------------------------- |
|
||||
| **Platform module + Durable Event Bus + existing notification channels** | ✅ Yes | Aligned with repo; zero new infra |
|
||||
| External ticketing (Jira / Linear) | No | Control-plane fit poor; audit trail fragmented |
|
||||
| Dedicated BPM (Camunda, Flowable) | No | Overkill; heavyweight |
|
||||
| Policy engine (OpenFGA / Cedar) | Later | Add once policy-as-code becomes a need |
|
||||
|
||||
---
|
||||
|
||||
## 8. Risks & Mitigations
|
||||
|
||||
| Risk | Mitigation |
|
||||
| -------------------------------------- | ----------------------------------------------------------------------------------------------------------- |
|
||||
| Review queue becomes dead-letter inbox | Mandatory SLA; auto-expire + escalate; dashboard surfaces queue age |
|
||||
| Inconsistent policy across products | Single `reviews/policies.ts` module; per-product overrides validated in CI |
|
||||
| Reviewer fatigue | Risk-based routing: low-risk items batch-decide; ship bulk-claim UI on day 1 |
|
||||
| Unaudited decisions | Every decision doc encrypted via `@bytelyst/field-encrypt` for `comment` field; reviewer identity immutable |
|
||||
| Cowork offline → blocks the agent | HTTP timeout 5s → fall back to local `CoworkPermissionPrompter`; decision reconciled on next reconnect |
|
||||
|
||||
---
|
||||
|
||||
## 9. Acceptance Checklist
|
||||
|
||||
- [ ] Phase 1 — generic review object + REST complete
|
||||
- [ ] Phase 2 — SLA engine + agent-run integration live
|
||||
- [ ] Phase 3 — admin UI + notifications across Slack/email/push
|
||||
- [ ] Phase 4 — cowork remote-approval flag + 1 product migration shipped
|
||||
- [ ] 30-day sunset plan documented for remaining product-local approval code
|
||||
- [ ] `docs/ECOSYSTEM_ARCHITECTURE.md` + AGENTS.md updated
|
||||
- [ ] Roadmap moved to `docs/roadmaps/completed/`
|
||||
@ -0,0 +1,164 @@
|
||||
# Knowledge & RAG Service — Detailed Roadmap (2026-04-17)
|
||||
|
||||
> **Supersedes:** `platform_KNOWLEDGE_RAG_SERVICE_ROADMAP.md` (scaffold).
|
||||
> **Status:** Planned — highest user-visible leverage once Event Bus lands.
|
||||
> **Primary surfaces:** `services/platform-service/src/modules/knowledge/`, `services/extraction-service/`, `packages/palace/`, new `packages/rag-client/`.
|
||||
> **Estimated effort:** 4–6 weeks (1 engineer) / 3 weeks (2 engineers).
|
||||
> **Priority:** **P1** — converges cowork institutional knowledge + MindLyst MemPalace + NoteLett palace + JarvisJr context.
|
||||
|
||||
---
|
||||
|
||||
## 1. Current State (as of 2026-04-17)
|
||||
|
||||
| Surface | State | Evidence |
|
||||
| ------------------------------------ | ------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------ |
|
||||
| `platform-service/modules/knowledge` | Basic CRUD over knowledge docs — no chunking, no embeddings, no retrieval API | `services/platform-service/src/modules/knowledge/{routes,repository,types}.ts` |
|
||||
| `@bytelyst/palace` | MindLyst-style memory package with embeddings scaffolding (19 files) | `packages/palace/` |
|
||||
| `extraction-service` | PDF/xlsx/docx text extraction via Python sidecar — no embedding step | `services/extraction-service/` |
|
||||
| `packages/llm-router` | Deterministic router (provider/model selection); no embeddings client yet | `packages/llm-router/` |
|
||||
| Cowork `InstitutionalKnowledge` | Rust-side per-project `.md`/`.txt` loader (≤100KB scan) | `crates/cowork-orchestrator/src/admin.rs` |
|
||||
| Products using RAG | MindLyst (custom Palace), NoteLett (custom Palace), LysnrAI (none), JarvisJr (none), ChronoMind (none) | grep of product backends |
|
||||
|
||||
**Net:** every product wants RAG; nothing shared; extraction exists but stops at "text + metadata". No citation surface, no ACL.
|
||||
|
||||
---
|
||||
|
||||
## 2. Success Criteria (V1)
|
||||
|
||||
1. One `POST /api/knowledge/documents` call ingests a PDF → extraction → chunking → embedding → indexed and queryable within 60s P95.
|
||||
2. `POST /api/knowledge/search` returns chunks with citations (`documentId`, `pageNumber`, `quoteStart`, `quoteEnd`, `score`) scoped by `productId` + `workspaceId`.
|
||||
3. MindLyst + NoteLett both swap their custom palace indexers for `@bytelyst/rag-client` without UX regression.
|
||||
4. Cowork's `InstitutionalKnowledge` can optionally call `@bytelyst/rag-client` instead of local file scan when the flag `knowledge_rag_enabled` is on.
|
||||
5. Tenant isolation verified by integration test: product A query cannot retrieve product B docs even when bucket names overlap.
|
||||
|
||||
---
|
||||
|
||||
## 3. Non-Goals (V1)
|
||||
|
||||
- Multi-hop retrieval / graph-RAG
|
||||
- Hybrid reranking with a dedicated cross-encoder model
|
||||
- User-facing document editors / rich collaboration
|
||||
- Non-text modalities (image/audio embeddings) — deferred to V2
|
||||
|
||||
---
|
||||
|
||||
## 4. Dependencies
|
||||
|
||||
- **Hard:** Durable Event Bus (C1) for ingestion pipeline reliability
|
||||
- **Soft:** Org & Workspace RBAC (existing `@bytelyst/org-client` is sufficient for V1; deeper ACL comes with SCIM roadmap)
|
||||
|
||||
---
|
||||
|
||||
## 5. Data Model
|
||||
|
||||
New Cosmos containers (partition key in **bold**):
|
||||
|
||||
| Container | PK | Purpose |
|
||||
| ----------------------- | -------------- | ------------------------------------------------------------------------------ |
|
||||
| `knowledge-documents` | **productId** | Doc metadata (title, source URI, mime, sha256, workspaceId, createdBy, status) |
|
||||
| `knowledge-chunks` | **documentId** | Text + embedding ref + page/line spans + tokens |
|
||||
| `knowledge-embeddings` | **documentId** | Vector payload; separate from chunks to support vector-store migration |
|
||||
| `knowledge-ingest-jobs` | **productId** | Job lifecycle (queued/extracting/embedding/indexed/failed) |
|
||||
|
||||
Every doc carries `productId` (house rule) + `workspaceId` + `sourceHash` for dedup.
|
||||
|
||||
---
|
||||
|
||||
## 6. Phase Plan
|
||||
|
||||
### Phase 1 — Document Ingestion Pipeline (week 1–2)
|
||||
|
||||
- [ ] **1.1** Extend `knowledge/types.ts` with `KnowledgeDocumentDoc`, `KnowledgeChunkDoc`, `KnowledgeIngestJobDoc`
|
||||
- [ ] **1.2** `POST /api/knowledge/documents` creates document row → emits `knowledge.ingest.requested` event
|
||||
- [ ] **1.3** Worker (via Durable Event Bus): pulls event → fetches source (blob URL or inline bytes) → calls `extraction-service`
|
||||
- [ ] **1.4** Extraction worker returns normalized Markdown + per-chunk anchors (page, line range)
|
||||
- [ ] **1.5** Chunker (new `packages/rag-chunker`):
|
||||
- sliding-window 800 tokens, 120 overlap (configurable)
|
||||
- preserves heading hierarchy for "section" metadata
|
||||
- 25+ unit tests: markdown tables, code fences, multi-page PDFs, RTL text
|
||||
- [ ] **1.6** Dedup: if `sourceHash` exists for `productId`+`workspaceId`, skip re-embedding; update references only
|
||||
|
||||
**Exit:** Upload `AGENTS.md` → within 60s, 10–30 chunks stored with anchors.
|
||||
|
||||
---
|
||||
|
||||
### Phase 2 — Embedding + Retrieval (week 2–3)
|
||||
|
||||
- [ ] **2.1** New `@bytelyst/llm-router` subpath: `llmRouter.embedBatch({ texts, model })` — default `text-embedding-3-small`
|
||||
- [ ] **2.2** Embedding worker consumes `knowledge.chunks.ready` events
|
||||
- Batching: up to 256 chunks / request
|
||||
- Retry + DLQ via Durable Event Bus
|
||||
- Cost accounting wired to `ai-budgets` module
|
||||
- [ ] **2.3** Vector store — **primary** pgvector via new `packages/vector-store/` adapter
|
||||
- `docker-compose.yml` adds Postgres 16 + pgvector extension
|
||||
- AKV binding: `AZURE_POSTGRES_CONNECTION_STRING`
|
||||
- Fallback: Cosmos native vector search (when Azure emulator supports it) as secondary adapter
|
||||
- [ ] **2.4** `POST /api/knowledge/search { query, topK, filters: { workspaceId, tags, dateRange } }`:
|
||||
- Embeds query (cache by sha256 of normalized query, 5-min TTL)
|
||||
- pgvector k-NN with metadata filter pushdown
|
||||
- Returns `{ chunkId, documentId, score, text, citation: { page, lineStart, lineEnd } }`
|
||||
- [ ] **2.5** `POST /api/knowledge/assemble { query, maxTokens, productId }` — composes a context block under `maxTokens` with inline citations (`[^1]` syntax)
|
||||
|
||||
**Exit:** integration test: upload a 50-page PDF → query → answer includes citations pointing to real page numbers.
|
||||
|
||||
---
|
||||
|
||||
### Phase 3 — Client SDK + Access Control (week 4)
|
||||
|
||||
- [ ] **3.1** New package `@bytelyst/rag-client` (mirrors existing client package patterns — see `subscription-client`, `billing-client`)
|
||||
- `createRagClient({ baseUrl, token, productId })`
|
||||
- Methods: `uploadDocument`, `search`, `assemble`, `getDocument`, `deleteDocument`, `listDocuments`
|
||||
- TypeScript types generated from platform zod schemas
|
||||
- [ ] **3.2** RBAC: every request checks `workspaceId` membership via `@bytelyst/org-client`
|
||||
- [ ] **3.3** Tenant isolation test suite (`knowledge-acl.test.ts`) — 10+ cases including cross-product attack scenarios
|
||||
- [ ] **3.4** Field-encrypt sensitive docs at rest (via `@bytelyst/field-encrypt`) — encrypt `text` column; chunk vectors remain plain (they're derived and low-info)
|
||||
|
||||
**Exit:** NoteLett integrates `@bytelyst/rag-client` in a branch; local E2E still passes.
|
||||
|
||||
---
|
||||
|
||||
### Phase 4 — Connectors + Operator UX (week 5–6)
|
||||
|
||||
- [ ] **4.1** Connector: file upload via blob (reuse `@bytelyst/blob`)
|
||||
- [ ] **4.2** Connector: URL crawl (single-page) using existing Python sidecar
|
||||
- [ ] **4.3** Connector: Gitea repo ingest (`.md` + `AGENTS.md` across a list of repos) — feeds cowork's knowledge loop
|
||||
- [ ] **4.4** Admin UI in `dashboards/admin-web/app/ops/knowledge/` — list, reindex, delete, see ingestion DLQ
|
||||
- [ ] **4.5** Metrics: `knowledge_ingest_latency_seconds`, `knowledge_search_latency_ms`, `knowledge_embedding_tokens_total`, surfaced in Grafana
|
||||
|
||||
**Exit:** 20 real `AGENTS.md` files across the workspace ingested; cowork query returns citations.
|
||||
|
||||
---
|
||||
|
||||
## 7. Tech Stack Decision
|
||||
|
||||
| Vector Backend | V1? | Rationale |
|
||||
| --------------------------- | ----------------- | ----------------------------------------------------------- |
|
||||
| **pgvector on Postgres 16** | ✅ Yes | Unified metadata+vector, SQL filtering, ACID for ACL, cheap |
|
||||
| Qdrant | Deferred | Better pure-vector perf; adds an extra system |
|
||||
| Cosmos vector search | Secondary adapter | Useful in Azure-only envs; still evolving |
|
||||
| Azure AI Search | No | Vendor lock-in; less control |
|
||||
|
||||
**Default:** pgvector primary, Cosmos secondary. Adapter interface allows swap without touching callers.
|
||||
|
||||
---
|
||||
|
||||
## 8. Risks & Mitigations
|
||||
|
||||
| Risk | Mitigation |
|
||||
| ----------------------------------- | --------------------------------------------------------------------------------- |
|
||||
| Retrieval leaks cross-tenant data | ACL filter enforced in adapter layer, not app layer; adversarial test suite |
|
||||
| Embedding cost balloons | `ai-budgets` enforcement at dispatch + dedup by sourceHash + query cache |
|
||||
| Stale embeddings after model change | `embeddingModel` recorded per chunk; reindex job triggered by model registry bump |
|
||||
| Product teams bypass the service | Deprecation path: mark palace methods `@deprecated`, CI warning, 90-day sunset |
|
||||
|
||||
---
|
||||
|
||||
## 9. Acceptance Checklist
|
||||
|
||||
- [ ] Phase 1 — ingestion pipeline live; DLQ visible in admin
|
||||
- [ ] Phase 2 — pgvector in compose; search returns citations
|
||||
- [ ] Phase 3 — `@bytelyst/rag-client` published; NoteLett migration branch green
|
||||
- [ ] Phase 4 — admin UI + Gitea connector + Grafana dashboard
|
||||
- [ ] Cowork `InstitutionalKnowledge` opt-in migration path documented
|
||||
- [ ] `docs/ECOSYSTEM_ARCHITECTURE.md` updated with the RAG substrate
|
||||
- [ ] Roadmap moved to `docs/roadmaps/completed/`
|
||||
Loading…
Reference in New Issue
Block a user