docs(roadmaps): detailed 2026-04-17 plans for event bus, RAG, agent runtime, approval queue

Four grounded roadmaps superseding the scaffolded versions with current-state audits, data models, week-by-week phase plans, tech-stack decisions, and acceptance checklists. Execution order: Event Bus (P0) unlocks Agent Runtime + RAG (P1 parallel), which unlock Approval Queue (P1).
2026-04-17 09:36:44 -07:00 · 2026-04-17 09:36:44 -07:00 · 46a16f06bc
commit 46a16f06bc
parent 7d266bfcc0
4 changed files with 663 additions and 0 deletions
--- a/docs/roadmaps/scaffolded/platform_AGENT_RUNTIME_ORCHESTRATION_ROADMAP_apr17.md
+++ b/docs/roadmaps/scaffolded/platform_AGENT_RUNTIME_ORCHESTRATION_ROADMAP_apr17.md
@ -0,0 +1,183 @@
+# Agent Runtime & Orchestration — Detailed Roadmap (2026-04-17)
+
+> **Supersedes:** `platform_AGENT_RUNTIME_ORCHESTRATION_ROADMAP.md` (scaffold).
+> **Status:** Planned — enables platform-hosted multi-step agent runs reusable across products.
+> **Primary surfaces:** `services/platform-service/src/modules/{agent-runtime,agents,runs}/`, `services/mcp-server/src/modules/a2a/`, `services/cowork-service/`.
+> **Estimated effort:** 3–5 weeks (1 engineer) / 2–3 weeks (2 engineers).
+> **Priority:** **P1** — closes the gap between cowork (desktop agent) and platform-hosted agents invoked by product backends or admins.
+
+---
+
+## 1. Current State (as of 2026-04-17)
+
+| Surface                                  | State                                                                                              | Evidence                                             |
+| ---------------------------------------- | -------------------------------------------------------------------------------------------------- | ---------------------------------------------------- |
+| `platform-service/modules/agent-runtime` | Routes only; no persistent state model                                                             | `routes.ts`, `routes.test.ts`                        |
+| `platform-service/modules/agents`        | Fuller: `repository`, `routes`, `executor`, `tool-registry`, `types` — but in-memory execution     | 8 files in `agents/`                                 |
+| `platform-service/modules/runs`          | Run tracker (`tracker.ts`), repository, routes — already models run lifecycle but not queue-backed | 7 files                                              |
+| `platform-service/modules/ai-budgets`    | Exists — enforces spend per product                                                                | `ai-budgets/`                                        |
+| `platform-service/modules/agent-evals`   | Exists — eval harness skeleton                                                                     | `agent-evals/`                                       |
+| `mcp-server/modules/a2a`                 | A2A pipelines as composed code paths; no durable step state                                        | `services/mcp-server/src/modules/a2a/`               |
+| `cowork-service/modules/agent-runtime`   | IPC bridge to cowork-orchestrator `submit_task`                                                    | `services/cowork-service/src/modules/agent-runtime/` |
+| `@bytelyst/queue`                        | `QueueWorker` + memory/file stores. No cross-process yet                                           | `packages/queue/`                                    |
+
+**Net:** the run data model is partially there; what's missing is the **durable queue-backed execution** plus **unified orchestration** so cowork, product-backend agents, and admin-initiated runs share one run API.
+
+---
+
+## 2. Success Criteria (V1)
+
+1. An `POST /api/agent-runs` request creates a queued run; executor worker picks it up and progresses through steps with full persistence.
+2. Kill the executor mid-step → run resumes from the last completed step on restart without duplicated side effects (idempotency on step outputs).
+3. A run can enter `waiting_for_review` (see Approval Queue roadmap) and resume when a reviewer decides.
+4. Cowork sessions created via `ipc-bridge` register as platform agent runs and appear in `/ops/runs`.
+5. Each run carries `aiBudgetId` and writes per-step token usage; over-budget runs pause, not crash.
+6. 300+ concurrent runs sustained with per-run latency < 2× single-run baseline.
+
+---
+
+## 3. Non-Goals (V1)
+
+- Full Temporal adoption (contract is Temporal-compatible for future migration; no Temporal infra yet)
+- User-authored workflow DSL
+- Multi-region run migration
+- Streaming step outputs to clients (add post-V1 via SSE + existing `@bytelyst/fastify-sse`)
+
+---
+
+## 4. Dependencies
+
+- **Hard:** Durable Event Bus & Worker Runtime (C1)
+- **Hard:** `@bytelyst/queue` Redis adapter (Phase 2 of C1)
+- **Soft:** Human Review Approval Queue (C4) for `waiting_for_review` semantics
+- **Soft:** Agent Registry & Prompt Versioning (existing scaffolded doc) — for deterministic `agentVersion` binding
+
+---
+
+## 5. Canonical Run Model
+
+New/expanded Cosmos containers (PK in **bold**):
+
+| Container          | PK            | Purpose                                                                                                    |
+| ------------------ | ------------- | ---------------------------------------------------------------------------------------------------------- |
+| `agent-runs`       | **productId** | `AgentRunDoc`: state, workflowId, agentId, agentVersion, triggerSource, parentRunId, aiBudgetId, createdBy |
+| `agent-run-steps`  | **runId**     | Step inputs/outputs/timings/errors; idempotency key; `state`                                               |
+| `agent-run-events` | **runId**     | Durable event log (mirror into event store) for replay/operator view                                       |
+| `agent-run-leases` | **runId**     | Worker lease with expiry — enables safe failover                                                           |
+
+Run states: `queued → running → waiting_for_input | waiting_for_review | paused → running → (succeeded | failed | cancelled)`.
+
+Step states: `pending → running → (succeeded | failed | retrying | skipped)`.
+
+Every doc includes `productId` + `workspaceId` + `correlationId`.
+
+---
+
+## 6. Phase Plan
+
+### Phase 1 — Durable Run Model (week 1)
+
+- [ ] **1.1** Extend `agent-runtime/types.ts` with `AgentRunDoc`, `AgentRunStepDoc`, `AgentRunEventDoc`, `AgentRunLeaseDoc` (zod schemas)
+- [ ] **1.2** Repository methods: `createRun`, `getRun`, `listRuns`, `appendStep`, `updateStepState`, `acquireLease`, `renewLease`, `releaseLease`
+- [ ] **1.3** Migrate existing `runs/` module → rename to legacy alias; new agent-runs module becomes source of truth
+- [ ] **1.4** REST surface:
+  - [ ] `POST /api/agent-runs` → creates run, publishes `agent.run.execute` event
+  - [ ] `GET /api/agent-runs/:id`
+  - [ ] `GET /api/agent-runs/:id/events` (paginated + optional SSE)
+  - [ ] `POST /api/agent-runs/:id/cancel` → sets `cancelRequested` flag read by executor
+  - [ ] `POST /api/agent-runs/:id/pause` / `/resume`
+  - [ ] `POST /api/agent-runs/:id/signal` — named signal (for approval, input, tool-result)
+- [ ] **1.5** 25+ route tests using `@bytelyst/testing` fastify inject helpers
+
+**Exit:** run lifecycle transitions persist across fastify restart; all state queryable.
+
+---
+
+### Phase 2 — Queue-Backed Executor (week 2)
+
+- [ ] **2.1** Add `agent.run.execute` queue type on top of Durable Event Bus + `@bytelyst/queue`
+- [ ] **2.2** `AgentRunExecutor` class in `modules/agent-runtime/executor.ts`:
+  - Acquires lease (30s heartbeat); holds until step completes
+  - For each step: resolves agent spec → calls tool (`agents/tool-registry.ts`) → records output with idempotency key
+  - On failure: exponential backoff, max retries per step from `agentVersion.retryPolicy`
+  - DLQ on exhausted retries; run → `failed`
+- [ ] **2.3** Idempotency:
+  - Caller can pass `idempotencyKey` on `POST /agent-runs`; repeat within 24h returns existing run
+  - Each step output stored by `{runId, stepIndex, attempt}`; external tool calls opt-in to idempotency via handler contract
+- [ ] **2.4** Budget integration: pre-step check against `ai-budgets`; over-budget → run → `waiting_for_input` with reason `budget_exceeded`
+- [ ] **2.5** Cancellation: `cancelRequested` polled between steps; in-flight step finishes cleanly, run → `cancelled`
+
+**Exit:** chaos test — kill executor process during a 5-step run. Run resumes from last checkpoint. Idempotent tool not double-invoked.
+
+---
+
+### Phase 3 — A2A + Cowork Integration (week 3)
+
+- [ ] **3.1** `mcp-server/modules/a2a` refactor: pipeline steps become agent-run steps. Pipeline definition → stored as workflow template, not code path.
+- [ ] **3.2** `cowork-service/modules/agent-runtime` updates: when cowork submits a task via IPC, also register as platform agent run (dual-write with `triggerSource=cowork`)
+- [ ] **3.3** Cross-system correlation: cowork's `taskId` stored as external ref on agent-run; admin `/ops/runs` surfaces both sides
+- [ ] **3.4** Agent Registry integration (depends on separately scaffolded roadmap): `agentId + agentVersion` required on run creation; registry prompt snapshot frozen at run start
+
+**Exit:** every cowork session visible under `/ops/runs`; every A2A pipeline surfaces step-by-step.
+
+---
+
+### Phase 4 — Operator Experience (week 4)
+
+- [ ] **4.1** Admin UI `dashboards/admin-web/app/ops/runs/`:
+  - List with filters (product, workspace, agent, state, age, reviewer)
+  - Timeline view per run
+  - Step diff view (prompt/tool-call/output)
+  - Cancel / retry / replay controls
+- [ ] **4.2** SLO dashboard: success rate, mean duration, retry rate, DLQ count per agent
+- [ ] **4.3** `@bytelyst/platform-client` additions: `client.agentRuns.create/get/cancel/signal` for product backends
+- [ ] **4.4** Migration guide: product backends that roll their own agent loops → 1-call swap
+
+**Exit:** one product backend (e.g., JarvisJr) dispatches an agent via platform runs instead of in-proc call.
+
+---
+
+### Phase 5 — Scale & Hardening (week 5)
+
+- [ ] **5.1** Horizontal scale: 2+ executor replicas via lease-based safety
+- [ ] **5.2** Load test: 300 concurrent runs, 5 steps each, measure P50/P95/P99
+- [ ] **5.3** Backpressure: when queue depth > N, new runs enter `queued` but `POST` returns immediately; admin can see queue saturation
+- [ ] **5.4** Circuit breakers: per-agent failure-rate trips new run acceptance for 60s
+- [ ] **5.5** Observability: OpenTelemetry spans around every step; trace ID propagates to external tool calls
+
+**Exit:** load test meets latency budget; operator dashboard usable under load.
+
+---
+
+## 7. Tech Stack Decision
+
+| Option                                                  | V1?                           | Rationale                                         |
+| ------------------------------------------------------- | ----------------------------- | ------------------------------------------------- |
+| **`@bytelyst/queue` + Redis adapter + Cosmos run docs** | ✅ Yes                        | Reuses Durable Event Bus; no new major infra      |
+| Temporal                                                | Deferred                      | Right answer for V2; too much infra for V1        |
+| BullMQ direct                                           | Replaced by `@bytelyst/queue` | Same class; we'd just be skipping our abstraction |
+| SQS / EventBridge                                       | Skip                          | Cloud-vendor coupling                             |
+
+---
+
+## 8. Risks & Mitigations
+
+| Risk                                    | Mitigation                                                                                                                                      |
+| --------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------- |
+| "Weak in-house Temporal" drift          | Contract borrows Temporal terminology (workflowId, runId, signals, child runs) so future migration is mechanical                                |
+| Step idempotency misuse                 | Executor requires handlers declare `idempotent: true`; non-idempotent defaults to single-attempt + failover-safe retry only on transient errors |
+| Lease contention at scale               | Leases include random jitter; stuck runs auto-released after 3× heartbeat period                                                                |
+| Product teams bypass to run local loops | Deprecation + budget-module enforcement: unregistered runs get 0 budget                                                                         |
+
+---
+
+## 9. Acceptance Checklist
+
+- [ ] Phase 1 — run model persistent; REST complete
+- [ ] Phase 2 — executor survives restart; idempotency proven
+- [ ] Phase 3 — A2A + cowork both integrated
+- [ ] Phase 4 — admin UI shipped; one product backend migrated
+- [ ] Phase 5 — 300-concurrent load test passes
+- [ ] `platform-client` published with new methods
+- [ ] `docs/ECOSYSTEM_ARCHITECTURE.md` + AGENTS.md updated
+- [ ] Roadmap moved to `docs/roadmaps/completed/`
--- a/docs/roadmaps/scaffolded/platform_DURABLE_EVENT_BUS_AND_WORKER_RUNTIME_ROADMAP_apr17.md
+++ b/docs/roadmaps/scaffolded/platform_DURABLE_EVENT_BUS_AND_WORKER_RUNTIME_ROADMAP_apr17.md
@ -0,0 +1,143 @@
+# Durable Event Bus & Worker Runtime — Detailed Roadmap (2026-04-17)
+
+> **Supersedes:** `platform_DURABLE_EVENT_BUS_AND_WORKER_RUNTIME_ROADMAP.md` (scaffold).
+> **Status:** Planned — unblocks Agent Runtime, Approval Queue, Knowledge ingestion.
+> **Primary surfaces:** `packages/events/`, `packages/queue/`, `packages/event-store/`, `services/platform-service/`.
+> **Estimated effort:** 3–4 weeks (1 engineer) / 2 weeks (2 engineers).
+> **Priority:** **P0** — foundation for Categories C3/C4 and for reliable cross-service side effects.
+
+---
+
+## 1. Current State (as of 2026-04-17)
+
+| Surface                 | State                                                                                                                                | Evidence                                   |
+| ----------------------- | ------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------ |
+| `@bytelyst/events`      | `EventBus` (memory) + `DurableEventBus` interface + typed schemas (AgentRun, AgentTask, AgentDispatch, AgentCheckpoint, Artifact, …) | `packages/events/src/index.ts`             |
+| `@bytelyst/event-store` | Cosmos-backed append-only event store (exists, underused)                                                                            | `packages/event-store/src/`                |
+| `@bytelyst/queue`       | `QueueWorker` + `MemoryQueueStore` + `FileQueueStore`. No Redis, no cross-process delivery                                           | `packages/queue/src/index.ts`              |
+| Platform subscribers    | `delivery`, `notifications`, `audit`, `runs/tracker.ts` emit on the in-proc memory bus; cross-service = ad-hoc HTTP                  | `services/platform-service/src/modules/*/` |
+| Worker processes        | None standardized. Scheduled jobs embed inside fastify via setInterval / per-module timers                                           | n/a                                        |
+
+**Net:** we have durable-queue and durable-event-store primitives but no **standardized durable bus or worker runtime** consuming them. Production flows still ride the in-memory bus.
+
+---
+
+## 2. Success Criteria (V1)
+
+1. One event emitted in `platform-service` is reliably consumed by `cowork-service`, `extraction-service`, or `mcp-server` within 1s P95, surviving restart on either side.
+2. Every durable subscriber records ack / retry / DLQ counts; operators can query/replay from Cosmos.
+3. A single `workers/` bootstrap pattern is documented and used by ≥3 side-effect consumers.
+4. No new module added after Phase 3 relies on the in-memory bus for production-critical flows.
+5. `@bytelyst/events` memory adapter remains the default in **test environments only** (`NODE_ENV=test`).
+
+---
+
+## 3. Non-Goals (V1)
+
+- Kafka-scale throughput, exactly-once delivery, cross-region replication
+- Full Temporal semantics (that's the Agent Runtime roadmap)
+- Replacing the HTTP REST layer between services
+
+---
+
+## 4. Dependencies
+
+- None (this is the foundation). Enables: Agent Runtime Orchestration, Human Review Approval Queue, Knowledge RAG ingestion.
+
+---
+
+## 5. Phase Plan
+
+### Phase 1 — Pluggable Event Bus Backends (week 1)
+
+- [ ] **1.1** Formalize `DurableEventAdapter` contract in `@bytelyst/events` (publish, subscribe, ack, nack, dlq, replay)
+- [ ] **1.2** Keep `MemoryAdapter` as the default for tests
+- [ ] **1.3** Implement `CosmosEventStoreAdapter` on top of existing `@bytelyst/event-store` (append + tail via change-feed)
+  - [ ] Partition key: `productId` + day-bucket
+  - [ ] Consumer groups via lease-collection (reuse `packages/cosmos/src/lease-manager` if present; else add lightweight doc-based lease)
+- [ ] **1.4** Add `correlationId` + `causationId` to every `PlatformEventSchemas` entry (schema-breaking; ship behind zod `.catchall()` fallback for unknown fields)
+- [ ] **1.5** Add `packages/events/src/durable.test.ts` — adapter conformance suite (publish/subscribe/ack/retry/DLQ)
+
+**Exit:** `pnpm --filter @bytelyst/events test` covers the Cosmos adapter using the Cosmos emulator.
+
+---
+
+### Phase 2 — Standardized Worker Runtime (week 2)
+
+- [ ] **2.1** New surface: `services/platform-service/src/workers/` with a single `createWorker(opts)` bootstrap
+  - [ ] Concurrency control (default: `AGENT_WORKER_CONCURRENCY=4`)
+  - [ ] Lease heartbeat (default 30s, configurable per handler)
+  - [ ] Graceful shutdown on SIGINT/SIGTERM
+  - [ ] Pino child logger with `workerName`, `eventId`, `correlationId` bindings
+- [ ] **2.2** Expose `/ops/workers/*` admin routes: list workers, inspect in-flight leases, force-replay DLQ item, pause/resume
+- [ ] **2.3** Health: each worker reports into `GET /api/health/dependencies` under a `workers` key
+- [ ] **2.4** Operator command: `pnpm --filter @lysnrai/platform-service worker:run <name>` for running a single worker out-of-process (k8s-ready)
+- [ ] **2.5** Dead-letter inspection: Cosmos container `events-dlq` with 30-day TTL; admin UI can re-enqueue
+
+**Exit:** `reviews` module's notification side-effects migrated to durable worker. Kill the Fastify process mid-emit; event still delivered after restart.
+
+---
+
+### Phase 3 — Migrate Production Side Effects (week 3)
+
+- [ ] **3.1** Audit in-memory `EventBus.emit()` call sites (expect ≤20) and categorize:
+  - Critical (must durably deliver): auth-events, run-state-changes, billing-events, approval-events
+  - Best-effort (OK to lose): dev-time dashboards, local UI refresh hints
+- [ ] **3.2** Migrate critical callers to `DurableEventBus` one module at a time:
+  - [ ] `auth` side-effects (login audit, MFA trigger)
+  - [ ] `runs/tracker.ts` state transitions
+  - [ ] `billing-checkout` / `dunning` / `subscriptions` webhooks
+  - [ ] `notifications` fan-out
+  - [ ] `delivery` outbox
+- [ ] **3.3** Replace per-module `setInterval` schedulers with registered durable workers (e.g., `retention/`, `exports/`, `predictive-analytics/`)
+- [ ] **3.4** Telemetry: emit `event_published_total`, `event_consumed_total`, `event_dlq_total`, `worker_lease_expired_total` via `@bytelyst/backend-telemetry`
+- [ ] **3.5** Grafana dashboard: event-lag P95 per subscriber, DLQ size, worker concurrency saturation
+
+**Exit:** post-migration, no production code path emits to `new EventBus()` in `NODE_ENV=production`. Enforced by an ESLint rule in `eslint.config.js` (`no-restricted-imports` scoped to `@bytelyst/events/memory`).
+
+---
+
+### Phase 4 — Cross-Service Subscriptions (week 3–4)
+
+- [ ] **4.1** `cowork-service` subscribes to `agent.run.*` events and mirrors state into its own tracker (currently polls `platform-service/runs`)
+- [ ] **4.2** `extraction-service` subscribes to `knowledge.ingest.requested` (see RAG roadmap) instead of exposing REST
+- [ ] **4.3** `mcp-server` emits `a2a.pipeline.step.*` events that become the audit trail source of truth
+- [ ] **4.4** Compose: add a Redis container to `docker-compose.yml` as an **optional** fast-path adapter (later phase)
+
+**Exit:** `pnpm prototype:self-test` exercises one end-to-end cross-service event.
+
+---
+
+## 6. Tech Stack Decision
+
+| Option                         | V1 default?              | Rationale                                                                |
+| ------------------------------ | ------------------------ | ------------------------------------------------------------------------ |
+| **Cosmos change-feed adapter** | ✅ Yes                   | Reuses existing emulator, persistence story, Azure runtime; no new infra |
+| Redis Streams                  | Later (Phase 4 optional) | Lower latency; add after Cosmos proves semantics                         |
+| NATS JetStream                 | Deferred                 | Best long-term; reassess at Phase 5+                                     |
+| Kafka                          | Never (overkill)         | —                                                                        |
+
+**Default decision:** Cosmos-backed adapter first → add Redis optional adapter at Phase 4 → revisit NATS once event volume > 10 events/sec sustained.
+
+---
+
+## 7. Risks & Mitigations
+
+| Risk                               | Mitigation                                                                                                 |
+| ---------------------------------- | ---------------------------------------------------------------------------------------------------------- |
+| Cosmos change-feed lag under burst | Per-consumer lease + configurable `maxItemCount`; alert on lag > 5s                                        |
+| Schema drift breaks replay         | Every event carries `schemaVersion`; adapters refuse unknown major versions; replay path calls migrator    |
+| Workers multiply process count     | `services/platform-service` runs workers in-process by default; only split out when concurrency demands it |
+| In-memory bus regressions          | Lint rule + CI grep blocks `new EventBus()` outside `*/test/*` paths                                       |
+
+---
+
+## 8. Acceptance Checklist
+
+- [ ] Phase 1 complete — adapter conformance suite passes on memory + Cosmos
+- [ ] Phase 2 complete — `/ops/workers` + DLQ admin routes shipped
+- [ ] Phase 3 complete — ≥5 modules migrated; ESLint guard active
+- [ ] Phase 4 complete — one cross-service flow demonstrated end-to-end
+- [ ] Grafana dashboard `platform-events` live
+- [ ] AGENTS.md + `docs/ECOSYSTEM_ARCHITECTURE.md` updated
+- [ ] Roadmap moved to `docs/roadmaps/completed/`
--- a/docs/roadmaps/scaffolded/platform_HUMAN_REVIEW_APPROVAL_QUEUE_ROADMAP_apr17.md
+++ b/docs/roadmaps/scaffolded/platform_HUMAN_REVIEW_APPROVAL_QUEUE_ROADMAP_apr17.md
@ -0,0 +1,173 @@
+# Human Review & Approval Queue — Detailed Roadmap (2026-04-17)
+
+> **Supersedes:** `platform_HUMAN_REVIEW_APPROVAL_QUEUE_ROADMAP.md` (scaffold).
+> **Status:** Planned — critical safety rail for autonomous agents.
+> **Primary surfaces:** `services/platform-service/src/modules/reviews/`, new `modules/approvals/`, new `modules/escalations/`, `dashboards/admin-web/`.
+> **Estimated effort:** 2–3 weeks (1 engineer).
+> **Priority:** **P1** — gates every write-capable agent action to avoid blast radius incidents.
+
+---
+
+## 1. Current State (as of 2026-04-17)
+
+| Surface                            | State                                                                                                 | Evidence                                        |
+| ---------------------------------- | ----------------------------------------------------------------------------------------------------- | ----------------------------------------------- |
+| `platform-service/modules/reviews` | Module exists: `notifications.ts`, `repository`, `routes`, `types` — partial implementation           | 6 files in `reviews/`                           |
+| MFA push-approval                  | Narrow auth flow only                                                                                 | `auth` module                                   |
+| Cowork destructive-action guard    | `CoworkPermissionPrompter` emits `approval-required` Tauri event; resolves in-process                 | `crates/cowork-orchestrator/src/permissions.rs` |
+| A2A pipeline pause-for-review      | Not implemented                                                                                       | —                                               |
+| Product backend approval flows     | Ad-hoc per product (JarvisJr, NomGap have custom)                                                     | grep confirms duplication                       |
+| Notifications                      | `notifications` module + delivery channels (Slack stub, email via Mailpit, push via `@bytelyst/push`) | `notifications/`                                |
+
+**Net:** `reviews` module has skeleton but isn't the **generic human-in-the-loop system**. Every approval currently shipped is bespoke.
+
+---
+
+## 2. Success Criteria (V1)
+
+1. Any agent run can emit `waiting_for_review` with a `reviewReason` and a decision contract; a reviewer's decision resumes or cancels the run.
+2. Cowork destructive actions (file delete, shell `rm`, commit push) route through the platform reviews module when `remote_approval_enabled=true` — otherwise keep local prompter.
+3. Reviewers see one inbox across products; can bulk-claim, add comments, and make decisions via admin UI or API.
+4. SLA timers: unclaimed reviews escalate at T1, auto-reject at T2 (configurable per risk level).
+5. Full audit trail: every decision captured with reviewer identity, timestamp, reason, supporting evidence pointers.
+6. One real product backend (JarvisJr or NomGap) removes its custom approval flow in favor of this.
+
+---
+
+## 3. Non-Goals (V1)
+
+- Full BPM suite or arbitrary approval-chain designer
+- End-user case portals (reviewer-only UX in V1)
+- Risk-score auto-decisions (deferred to V2 with policy engine)
+- Legal-hold / eDiscovery extensions
+
+---
+
+## 4. Dependencies
+
+- **Hard:** Durable Event Bus (C1) — for review events and notification fan-out
+- **Hard:** Agent Runtime Orchestration (C3) — for `waiting_for_review` state integration
+- **Soft:** Org & Workspace RBAC scaffolded roadmap — for reviewer assignment by role
+
+---
+
+## 5. Data Model
+
+New Cosmos containers (PK in **bold**):
+
+| Container            | PK            | Purpose                                                                                 |
+| -------------------- | ------------- | --------------------------------------------------------------------------------------- |
+| `reviews`            | **productId** | `ReviewDoc`: subjectType, subjectId, riskLevel, requiredDecision, assignees, state, sla |
+| `review-decisions`   | **reviewId**  | `DecisionDoc`: decision, reviewerId, comment, decidedAt, attachments                    |
+| `review-escalations` | **reviewId**  | `EscalationDoc`: level, triggeredAt, reassignedTo, reason                               |
+
+Subject types: `agent_run_step`, `external_action`, `content_publication`, `prompt_change`, `low_confidence_output`, `tool_invocation`, `custom` (with `subjectPayload` JSON).
+
+States: `pending → claimed → (approved | rejected | expired | superseded)`.
+
+Risk levels: `low | medium | high | critical` → dictates SLA (24h / 8h / 2h / 30m default).
+
+---
+
+## 6. Phase Plan
+
+### Phase 1 — Generic Review Object (week 1)
+
+- [ ] **1.1** Extend `reviews/types.ts`: `ReviewDoc`, `DecisionDoc`, `EscalationDoc` zod schemas
+- [ ] **1.2** Repository: `createReview`, `getReview`, `listReviews`, `claimReview`, `submitDecision`, `expireReview`, `reassign`
+- [ ] **1.3** REST surface:
+  - [ ] `POST /api/reviews` (service-to-service)
+  - [ ] `GET /api/reviews/:id`
+  - [ ] `GET /api/reviews` (filters: product, workspace, state, reviewer, risk, age)
+  - [ ] `POST /api/reviews/:id/claim`
+  - [ ] `POST /api/reviews/:id/decide { decision, comment }`
+  - [ ] `POST /api/reviews/:id/reassign { reviewerIds }`
+- [ ] **1.4** Reviewer authz: `canReview(productId, workspaceId, userId, riskLevel)` helper (reuse `@bytelyst/org-client`)
+- [ ] **1.5** 20+ route tests covering ACL edges (cross-product, cross-workspace, expired claim)
+
+**Exit:** create a review via curl → claim → decide → state machine enforced.
+
+---
+
+### Phase 2 — Workflow & Escalation (week 2)
+
+- [ ] **2.1** SLA engine: background worker (on Durable Event Bus) scans pending reviews; escalates at T1, expires at T2
+- [ ] **2.2** Emit events: `review.created`, `review.claimed`, `review.decided`, `review.escalated`, `review.expired`
+- [ ] **2.3** Agent Runtime integration:
+  - Agent run step can call `requestReview(subject, risk)` → step state `waiting_for_review` → run state `waiting_for_review`
+  - `review.decided` event signals run; `approved` → continue, `rejected`/`expired` → fail step
+- [ ] **2.4** Escalation policies: per-product JSON config in `reviews/policies.ts` (default + product overrides)
+- [ ] **2.5** Supersession: if a new review for same subject arrives, the old one auto-`superseded` with audit link
+
+**Exit:** integration test: agent run requests review → 5s later auto-escalates → reviewer decides → run resumes.
+
+---
+
+### Phase 3 — Reviewer Delivery + UI (week 2–3)
+
+- [ ] **3.1** Notification fan-out on `review.created` + `review.escalated`:
+  - Email via Mailpit / production SMTP
+  - Slack channel webhook (reuse `notifications`)
+  - Push via `@bytelyst/push` to assigned reviewer's devices
+  - Telegram (optional adapter)
+- [ ] **3.2** Admin UI `dashboards/admin-web/app/ops/reviews/`:
+  - Queue view with filters
+  - Bulk-claim
+  - Decision form with evidence pane (renders `subjectPayload`)
+  - Timeline per review (events + escalations + decision)
+- [ ] **3.3** Reviewer mobile hook: `@bytelyst/react-native-platform-sdk` exposes `useReviewInbox()`
+- [ ] **3.4** Keyboard shortcuts for reviewers: `j/k` navigate, `a` approve, `r` reject, `c` comment, `e` escalate
+
+**Exit:** reviewer completes 20 real reviews end-to-end with <30s median decision time on low-risk items.
+
+---
+
+### Phase 4 — Cowork + First Product Migration (week 3)
+
+- [ ] **4.1** Cowork integration:
+  - New flag `remote_approval_enabled` on cowork-service
+  - When true, `CoworkPermissionPrompter` calls `platform-service POST /api/reviews` instead of emitting local Tauri event
+  - Polls / subscribes for decision; resumes tool call on approval
+- [ ] **4.2** Rust-side adapter in `crates/cowork-orchestrator/src/permissions.rs` with HTTP fallback when platform unreachable → local prompt
+- [ ] **4.3** Migrate one product backend (recommend JarvisJr) off custom approval flow:
+  - Identify all `needsApproval`/`pendingConfirm` sites
+  - Replace with `platformClient.reviews.create()`
+  - Delete product-local approval tables after 30-day sunset
+- [ ] **4.4** Document migration playbook for remaining products
+
+**Exit:** JarvisJr custom approval code deleted; all approvals in admin queue.
+
+---
+
+## 7. Tech Stack Decision
+
+| Option                                                                   | V1?    | Rationale                                      |
+| ------------------------------------------------------------------------ | ------ | ---------------------------------------------- |
+| **Platform module + Durable Event Bus + existing notification channels** | ✅ Yes | Aligned with repo; zero new infra              |
+| External ticketing (Jira / Linear)                                       | No     | Control-plane fit poor; audit trail fragmented |
+| Dedicated BPM (Camunda, Flowable)                                        | No     | Overkill; heavyweight                          |
+| Policy engine (OpenFGA / Cedar)                                          | Later  | Add once policy-as-code becomes a need         |
+
+---
+
+## 8. Risks & Mitigations
+
+| Risk                                   | Mitigation                                                                                                  |
+| -------------------------------------- | ----------------------------------------------------------------------------------------------------------- |
+| Review queue becomes dead-letter inbox | Mandatory SLA; auto-expire + escalate; dashboard surfaces queue age                                         |
+| Inconsistent policy across products    | Single `reviews/policies.ts` module; per-product overrides validated in CI                                  |
+| Reviewer fatigue                       | Risk-based routing: low-risk items batch-decide; ship bulk-claim UI on day 1                                |
+| Unaudited decisions                    | Every decision doc encrypted via `@bytelyst/field-encrypt` for `comment` field; reviewer identity immutable |
+| Cowork offline → blocks the agent      | HTTP timeout 5s → fall back to local `CoworkPermissionPrompter`; decision reconciled on next reconnect      |
+
+---
+
+## 9. Acceptance Checklist
+
+- [ ] Phase 1 — generic review object + REST complete
+- [ ] Phase 2 — SLA engine + agent-run integration live
+- [ ] Phase 3 — admin UI + notifications across Slack/email/push
+- [ ] Phase 4 — cowork remote-approval flag + 1 product migration shipped
+- [ ] 30-day sunset plan documented for remaining product-local approval code
+- [ ] `docs/ECOSYSTEM_ARCHITECTURE.md` + AGENTS.md updated
+- [ ] Roadmap moved to `docs/roadmaps/completed/`
--- a/docs/roadmaps/scaffolded/platform_KNOWLEDGE_RAG_SERVICE_ROADMAP_apr17.md
+++ b/docs/roadmaps/scaffolded/platform_KNOWLEDGE_RAG_SERVICE_ROADMAP_apr17.md
@ -0,0 +1,164 @@
+# Knowledge & RAG Service — Detailed Roadmap (2026-04-17)
+
+> **Supersedes:** `platform_KNOWLEDGE_RAG_SERVICE_ROADMAP.md` (scaffold).
+> **Status:** Planned — highest user-visible leverage once Event Bus lands.
+> **Primary surfaces:** `services/platform-service/src/modules/knowledge/`, `services/extraction-service/`, `packages/palace/`, new `packages/rag-client/`.
+> **Estimated effort:** 4–6 weeks (1 engineer) / 3 weeks (2 engineers).
+> **Priority:** **P1** — converges cowork institutional knowledge + MindLyst MemPalace + NoteLett palace + JarvisJr context.
+
+---
+
+## 1. Current State (as of 2026-04-17)
+
+| Surface                              | State                                                                                                  | Evidence                                                                       |
+| ------------------------------------ | ------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------ |
+| `platform-service/modules/knowledge` | Basic CRUD over knowledge docs — no chunking, no embeddings, no retrieval API                          | `services/platform-service/src/modules/knowledge/{routes,repository,types}.ts` |
+| `@bytelyst/palace`                   | MindLyst-style memory package with embeddings scaffolding (19 files)                                   | `packages/palace/`                                                             |
+| `extraction-service`                 | PDF/xlsx/docx text extraction via Python sidecar — no embedding step                                   | `services/extraction-service/`                                                 |
+| `packages/llm-router`                | Deterministic router (provider/model selection); no embeddings client yet                              | `packages/llm-router/`                                                         |
+| Cowork `InstitutionalKnowledge`      | Rust-side per-project `.md`/`.txt` loader (≤100KB scan)                                                | `crates/cowork-orchestrator/src/admin.rs`                                      |
+| Products using RAG                   | MindLyst (custom Palace), NoteLett (custom Palace), LysnrAI (none), JarvisJr (none), ChronoMind (none) | grep of product backends                                                       |
+
+**Net:** every product wants RAG; nothing shared; extraction exists but stops at "text + metadata". No citation surface, no ACL.
+
+---
+
+## 2. Success Criteria (V1)
+
+1. One `POST /api/knowledge/documents` call ingests a PDF → extraction → chunking → embedding → indexed and queryable within 60s P95.
+2. `POST /api/knowledge/search` returns chunks with citations (`documentId`, `pageNumber`, `quoteStart`, `quoteEnd`, `score`) scoped by `productId` + `workspaceId`.
+3. MindLyst + NoteLett both swap their custom palace indexers for `@bytelyst/rag-client` without UX regression.
+4. Cowork's `InstitutionalKnowledge` can optionally call `@bytelyst/rag-client` instead of local file scan when the flag `knowledge_rag_enabled` is on.
+5. Tenant isolation verified by integration test: product A query cannot retrieve product B docs even when bucket names overlap.
+
+---
+
+## 3. Non-Goals (V1)
+
+- Multi-hop retrieval / graph-RAG
+- Hybrid reranking with a dedicated cross-encoder model
+- User-facing document editors / rich collaboration
+- Non-text modalities (image/audio embeddings) — deferred to V2
+
+---
+
+## 4. Dependencies
+
+- **Hard:** Durable Event Bus (C1) for ingestion pipeline reliability
+- **Soft:** Org & Workspace RBAC (existing `@bytelyst/org-client` is sufficient for V1; deeper ACL comes with SCIM roadmap)
+
+---
+
+## 5. Data Model
+
+New Cosmos containers (partition key in **bold**):
+
+| Container               | PK             | Purpose                                                                        |
+| ----------------------- | -------------- | ------------------------------------------------------------------------------ |
+| `knowledge-documents`   | **productId**  | Doc metadata (title, source URI, mime, sha256, workspaceId, createdBy, status) |
+| `knowledge-chunks`      | **documentId** | Text + embedding ref + page/line spans + tokens                                |
+| `knowledge-embeddings`  | **documentId** | Vector payload; separate from chunks to support vector-store migration         |
+| `knowledge-ingest-jobs` | **productId**  | Job lifecycle (queued/extracting/embedding/indexed/failed)                     |
+
+Every doc carries `productId` (house rule) + `workspaceId` + `sourceHash` for dedup.
+
+---
+
+## 6. Phase Plan
+
+### Phase 1 — Document Ingestion Pipeline (week 1–2)
+
+- [ ] **1.1** Extend `knowledge/types.ts` with `KnowledgeDocumentDoc`, `KnowledgeChunkDoc`, `KnowledgeIngestJobDoc`
+- [ ] **1.2** `POST /api/knowledge/documents` creates document row → emits `knowledge.ingest.requested` event
+- [ ] **1.3** Worker (via Durable Event Bus): pulls event → fetches source (blob URL or inline bytes) → calls `extraction-service`
+- [ ] **1.4** Extraction worker returns normalized Markdown + per-chunk anchors (page, line range)
+- [ ] **1.5** Chunker (new `packages/rag-chunker`):
+  - sliding-window 800 tokens, 120 overlap (configurable)
+  - preserves heading hierarchy for "section" metadata
+  - 25+ unit tests: markdown tables, code fences, multi-page PDFs, RTL text
+- [ ] **1.6** Dedup: if `sourceHash` exists for `productId`+`workspaceId`, skip re-embedding; update references only
+
+**Exit:** Upload `AGENTS.md` → within 60s, 10–30 chunks stored with anchors.
+
+---
+
+### Phase 2 — Embedding + Retrieval (week 2–3)
+
+- [ ] **2.1** New `@bytelyst/llm-router` subpath: `llmRouter.embedBatch({ texts, model })` — default `text-embedding-3-small`
+- [ ] **2.2** Embedding worker consumes `knowledge.chunks.ready` events
+  - Batching: up to 256 chunks / request
+  - Retry + DLQ via Durable Event Bus
+  - Cost accounting wired to `ai-budgets` module
+- [ ] **2.3** Vector store — **primary** pgvector via new `packages/vector-store/` adapter
+  - `docker-compose.yml` adds Postgres 16 + pgvector extension
+  - AKV binding: `AZURE_POSTGRES_CONNECTION_STRING`
+  - Fallback: Cosmos native vector search (when Azure emulator supports it) as secondary adapter
+- [ ] **2.4** `POST /api/knowledge/search { query, topK, filters: { workspaceId, tags, dateRange } }`:
+  - Embeds query (cache by sha256 of normalized query, 5-min TTL)
+  - pgvector k-NN with metadata filter pushdown
+  - Returns `{ chunkId, documentId, score, text, citation: { page, lineStart, lineEnd } }`
+- [ ] **2.5** `POST /api/knowledge/assemble { query, maxTokens, productId }` — composes a context block under `maxTokens` with inline citations (`[^1]` syntax)
+
+**Exit:** integration test: upload a 50-page PDF → query → answer includes citations pointing to real page numbers.
+
+---
+
+### Phase 3 — Client SDK + Access Control (week 4)
+
+- [ ] **3.1** New package `@bytelyst/rag-client` (mirrors existing client package patterns — see `subscription-client`, `billing-client`)
+  - `createRagClient({ baseUrl, token, productId })`
+  - Methods: `uploadDocument`, `search`, `assemble`, `getDocument`, `deleteDocument`, `listDocuments`
+  - TypeScript types generated from platform zod schemas
+- [ ] **3.2** RBAC: every request checks `workspaceId` membership via `@bytelyst/org-client`
+- [ ] **3.3** Tenant isolation test suite (`knowledge-acl.test.ts`) — 10+ cases including cross-product attack scenarios
+- [ ] **3.4** Field-encrypt sensitive docs at rest (via `@bytelyst/field-encrypt`) — encrypt `text` column; chunk vectors remain plain (they're derived and low-info)
+
+**Exit:** NoteLett integrates `@bytelyst/rag-client` in a branch; local E2E still passes.
+
+---
+
+### Phase 4 — Connectors + Operator UX (week 5–6)
+
+- [ ] **4.1** Connector: file upload via blob (reuse `@bytelyst/blob`)
+- [ ] **4.2** Connector: URL crawl (single-page) using existing Python sidecar
+- [ ] **4.3** Connector: Gitea repo ingest (`.md` + `AGENTS.md` across a list of repos) — feeds cowork's knowledge loop
+- [ ] **4.4** Admin UI in `dashboards/admin-web/app/ops/knowledge/` — list, reindex, delete, see ingestion DLQ
+- [ ] **4.5** Metrics: `knowledge_ingest_latency_seconds`, `knowledge_search_latency_ms`, `knowledge_embedding_tokens_total`, surfaced in Grafana
+
+**Exit:** 20 real `AGENTS.md` files across the workspace ingested; cowork query returns citations.
+
+---
+
+## 7. Tech Stack Decision
+
+| Vector Backend              | V1?               | Rationale                                                   |
+| --------------------------- | ----------------- | ----------------------------------------------------------- |
+| **pgvector on Postgres 16** | ✅ Yes            | Unified metadata+vector, SQL filtering, ACID for ACL, cheap |
+| Qdrant                      | Deferred          | Better pure-vector perf; adds an extra system               |
+| Cosmos vector search        | Secondary adapter | Useful in Azure-only envs; still evolving                   |
+| Azure AI Search             | No                | Vendor lock-in; less control                                |
+
+**Default:** pgvector primary, Cosmos secondary. Adapter interface allows swap without touching callers.
+
+---
+
+## 8. Risks & Mitigations
+
+| Risk                                | Mitigation                                                                        |
+| ----------------------------------- | --------------------------------------------------------------------------------- |
+| Retrieval leaks cross-tenant data   | ACL filter enforced in adapter layer, not app layer; adversarial test suite       |
+| Embedding cost balloons             | `ai-budgets` enforcement at dispatch + dedup by sourceHash + query cache          |
+| Stale embeddings after model change | `embeddingModel` recorded per chunk; reindex job triggered by model registry bump |
+| Product teams bypass the service    | Deprecation path: mark palace methods `@deprecated`, CI warning, 90-day sunset    |
+
+---
+
+## 9. Acceptance Checklist
+
+- [ ] Phase 1 — ingestion pipeline live; DLQ visible in admin
+- [ ] Phase 2 — pgvector in compose; search returns citations
+- [ ] Phase 3 — `@bytelyst/rag-client` published; NoteLett migration branch green
+- [ ] Phase 4 — admin UI + Gitea connector + Grafana dashboard
+- [ ] Cowork `InstitutionalKnowledge` opt-in migration path documented
+- [ ] `docs/ECOSYSTEM_ARCHITECTURE.md` updated with the RAG substrate
+- [ ] Roadmap moved to `docs/roadmaps/completed/`