From 46a16f06bc9f602ab7466afe8f45ea2eef695154 Mon Sep 17 00:00:00 2001 From: saravanakumardb1 Date: Fri, 17 Apr 2026 09:36:44 -0700 Subject: [PATCH] docs(roadmaps): detailed 2026-04-17 plans for event bus, RAG, agent runtime, approval queue Four grounded roadmaps superseding the scaffolded versions with current-state audits, data models, week-by-week phase plans, tech-stack decisions, and acceptance checklists. Execution order: Event Bus (P0) unlocks Agent Runtime + RAG (P1 parallel), which unlock Approval Queue (P1). --- ...ENT_RUNTIME_ORCHESTRATION_ROADMAP_apr17.md | 183 ++++++++++++++++++ ...NT_BUS_AND_WORKER_RUNTIME_ROADMAP_apr17.md | 143 ++++++++++++++ ...MAN_REVIEW_APPROVAL_QUEUE_ROADMAP_apr17.md | 173 +++++++++++++++++ ...orm_KNOWLEDGE_RAG_SERVICE_ROADMAP_apr17.md | 164 ++++++++++++++++ 4 files changed, 663 insertions(+) create mode 100644 docs/roadmaps/scaffolded/platform_AGENT_RUNTIME_ORCHESTRATION_ROADMAP_apr17.md create mode 100644 docs/roadmaps/scaffolded/platform_DURABLE_EVENT_BUS_AND_WORKER_RUNTIME_ROADMAP_apr17.md create mode 100644 docs/roadmaps/scaffolded/platform_HUMAN_REVIEW_APPROVAL_QUEUE_ROADMAP_apr17.md create mode 100644 docs/roadmaps/scaffolded/platform_KNOWLEDGE_RAG_SERVICE_ROADMAP_apr17.md diff --git a/docs/roadmaps/scaffolded/platform_AGENT_RUNTIME_ORCHESTRATION_ROADMAP_apr17.md b/docs/roadmaps/scaffolded/platform_AGENT_RUNTIME_ORCHESTRATION_ROADMAP_apr17.md new file mode 100644 index 00000000..50374de5 --- /dev/null +++ b/docs/roadmaps/scaffolded/platform_AGENT_RUNTIME_ORCHESTRATION_ROADMAP_apr17.md @@ -0,0 +1,183 @@ +# Agent Runtime & Orchestration — Detailed Roadmap (2026-04-17) + +> **Supersedes:** `platform_AGENT_RUNTIME_ORCHESTRATION_ROADMAP.md` (scaffold). +> **Status:** Planned — enables platform-hosted multi-step agent runs reusable across products. +> **Primary surfaces:** `services/platform-service/src/modules/{agent-runtime,agents,runs}/`, `services/mcp-server/src/modules/a2a/`, `services/cowork-service/`. +> **Estimated effort:** 3–5 weeks (1 engineer) / 2–3 weeks (2 engineers). +> **Priority:** **P1** — closes the gap between cowork (desktop agent) and platform-hosted agents invoked by product backends or admins. + +--- + +## 1. Current State (as of 2026-04-17) + +| Surface | State | Evidence | +| ---------------------------------------- | -------------------------------------------------------------------------------------------------- | ---------------------------------------------------- | +| `platform-service/modules/agent-runtime` | Routes only; no persistent state model | `routes.ts`, `routes.test.ts` | +| `platform-service/modules/agents` | Fuller: `repository`, `routes`, `executor`, `tool-registry`, `types` — but in-memory execution | 8 files in `agents/` | +| `platform-service/modules/runs` | Run tracker (`tracker.ts`), repository, routes — already models run lifecycle but not queue-backed | 7 files | +| `platform-service/modules/ai-budgets` | Exists — enforces spend per product | `ai-budgets/` | +| `platform-service/modules/agent-evals` | Exists — eval harness skeleton | `agent-evals/` | +| `mcp-server/modules/a2a` | A2A pipelines as composed code paths; no durable step state | `services/mcp-server/src/modules/a2a/` | +| `cowork-service/modules/agent-runtime` | IPC bridge to cowork-orchestrator `submit_task` | `services/cowork-service/src/modules/agent-runtime/` | +| `@bytelyst/queue` | `QueueWorker` + memory/file stores. No cross-process yet | `packages/queue/` | + +**Net:** the run data model is partially there; what's missing is the **durable queue-backed execution** plus **unified orchestration** so cowork, product-backend agents, and admin-initiated runs share one run API. + +--- + +## 2. Success Criteria (V1) + +1. An `POST /api/agent-runs` request creates a queued run; executor worker picks it up and progresses through steps with full persistence. +2. Kill the executor mid-step → run resumes from the last completed step on restart without duplicated side effects (idempotency on step outputs). +3. A run can enter `waiting_for_review` (see Approval Queue roadmap) and resume when a reviewer decides. +4. Cowork sessions created via `ipc-bridge` register as platform agent runs and appear in `/ops/runs`. +5. Each run carries `aiBudgetId` and writes per-step token usage; over-budget runs pause, not crash. +6. 300+ concurrent runs sustained with per-run latency < 2× single-run baseline. + +--- + +## 3. Non-Goals (V1) + +- Full Temporal adoption (contract is Temporal-compatible for future migration; no Temporal infra yet) +- User-authored workflow DSL +- Multi-region run migration +- Streaming step outputs to clients (add post-V1 via SSE + existing `@bytelyst/fastify-sse`) + +--- + +## 4. Dependencies + +- **Hard:** Durable Event Bus & Worker Runtime (C1) +- **Hard:** `@bytelyst/queue` Redis adapter (Phase 2 of C1) +- **Soft:** Human Review Approval Queue (C4) for `waiting_for_review` semantics +- **Soft:** Agent Registry & Prompt Versioning (existing scaffolded doc) — for deterministic `agentVersion` binding + +--- + +## 5. Canonical Run Model + +New/expanded Cosmos containers (PK in **bold**): + +| Container | PK | Purpose | +| ------------------ | ------------- | ---------------------------------------------------------------------------------------------------------- | +| `agent-runs` | **productId** | `AgentRunDoc`: state, workflowId, agentId, agentVersion, triggerSource, parentRunId, aiBudgetId, createdBy | +| `agent-run-steps` | **runId** | Step inputs/outputs/timings/errors; idempotency key; `state` | +| `agent-run-events` | **runId** | Durable event log (mirror into event store) for replay/operator view | +| `agent-run-leases` | **runId** | Worker lease with expiry — enables safe failover | + +Run states: `queued → running → waiting_for_input | waiting_for_review | paused → running → (succeeded | failed | cancelled)`. + +Step states: `pending → running → (succeeded | failed | retrying | skipped)`. + +Every doc includes `productId` + `workspaceId` + `correlationId`. + +--- + +## 6. Phase Plan + +### Phase 1 — Durable Run Model (week 1) + +- [ ] **1.1** Extend `agent-runtime/types.ts` with `AgentRunDoc`, `AgentRunStepDoc`, `AgentRunEventDoc`, `AgentRunLeaseDoc` (zod schemas) +- [ ] **1.2** Repository methods: `createRun`, `getRun`, `listRuns`, `appendStep`, `updateStepState`, `acquireLease`, `renewLease`, `releaseLease` +- [ ] **1.3** Migrate existing `runs/` module → rename to legacy alias; new agent-runs module becomes source of truth +- [ ] **1.4** REST surface: + - [ ] `POST /api/agent-runs` → creates run, publishes `agent.run.execute` event + - [ ] `GET /api/agent-runs/:id` + - [ ] `GET /api/agent-runs/:id/events` (paginated + optional SSE) + - [ ] `POST /api/agent-runs/:id/cancel` → sets `cancelRequested` flag read by executor + - [ ] `POST /api/agent-runs/:id/pause` / `/resume` + - [ ] `POST /api/agent-runs/:id/signal` — named signal (for approval, input, tool-result) +- [ ] **1.5** 25+ route tests using `@bytelyst/testing` fastify inject helpers + +**Exit:** run lifecycle transitions persist across fastify restart; all state queryable. + +--- + +### Phase 2 — Queue-Backed Executor (week 2) + +- [ ] **2.1** Add `agent.run.execute` queue type on top of Durable Event Bus + `@bytelyst/queue` +- [ ] **2.2** `AgentRunExecutor` class in `modules/agent-runtime/executor.ts`: + - Acquires lease (30s heartbeat); holds until step completes + - For each step: resolves agent spec → calls tool (`agents/tool-registry.ts`) → records output with idempotency key + - On failure: exponential backoff, max retries per step from `agentVersion.retryPolicy` + - DLQ on exhausted retries; run → `failed` +- [ ] **2.3** Idempotency: + - Caller can pass `idempotencyKey` on `POST /agent-runs`; repeat within 24h returns existing run + - Each step output stored by `{runId, stepIndex, attempt}`; external tool calls opt-in to idempotency via handler contract +- [ ] **2.4** Budget integration: pre-step check against `ai-budgets`; over-budget → run → `waiting_for_input` with reason `budget_exceeded` +- [ ] **2.5** Cancellation: `cancelRequested` polled between steps; in-flight step finishes cleanly, run → `cancelled` + +**Exit:** chaos test — kill executor process during a 5-step run. Run resumes from last checkpoint. Idempotent tool not double-invoked. + +--- + +### Phase 3 — A2A + Cowork Integration (week 3) + +- [ ] **3.1** `mcp-server/modules/a2a` refactor: pipeline steps become agent-run steps. Pipeline definition → stored as workflow template, not code path. +- [ ] **3.2** `cowork-service/modules/agent-runtime` updates: when cowork submits a task via IPC, also register as platform agent run (dual-write with `triggerSource=cowork`) +- [ ] **3.3** Cross-system correlation: cowork's `taskId` stored as external ref on agent-run; admin `/ops/runs` surfaces both sides +- [ ] **3.4** Agent Registry integration (depends on separately scaffolded roadmap): `agentId + agentVersion` required on run creation; registry prompt snapshot frozen at run start + +**Exit:** every cowork session visible under `/ops/runs`; every A2A pipeline surfaces step-by-step. + +--- + +### Phase 4 — Operator Experience (week 4) + +- [ ] **4.1** Admin UI `dashboards/admin-web/app/ops/runs/`: + - List with filters (product, workspace, agent, state, age, reviewer) + - Timeline view per run + - Step diff view (prompt/tool-call/output) + - Cancel / retry / replay controls +- [ ] **4.2** SLO dashboard: success rate, mean duration, retry rate, DLQ count per agent +- [ ] **4.3** `@bytelyst/platform-client` additions: `client.agentRuns.create/get/cancel/signal` for product backends +- [ ] **4.4** Migration guide: product backends that roll their own agent loops → 1-call swap + +**Exit:** one product backend (e.g., JarvisJr) dispatches an agent via platform runs instead of in-proc call. + +--- + +### Phase 5 — Scale & Hardening (week 5) + +- [ ] **5.1** Horizontal scale: 2+ executor replicas via lease-based safety +- [ ] **5.2** Load test: 300 concurrent runs, 5 steps each, measure P50/P95/P99 +- [ ] **5.3** Backpressure: when queue depth > N, new runs enter `queued` but `POST` returns immediately; admin can see queue saturation +- [ ] **5.4** Circuit breakers: per-agent failure-rate trips new run acceptance for 60s +- [ ] **5.5** Observability: OpenTelemetry spans around every step; trace ID propagates to external tool calls + +**Exit:** load test meets latency budget; operator dashboard usable under load. + +--- + +## 7. Tech Stack Decision + +| Option | V1? | Rationale | +| ------------------------------------------------------- | ----------------------------- | ------------------------------------------------- | +| **`@bytelyst/queue` + Redis adapter + Cosmos run docs** | ✅ Yes | Reuses Durable Event Bus; no new major infra | +| Temporal | Deferred | Right answer for V2; too much infra for V1 | +| BullMQ direct | Replaced by `@bytelyst/queue` | Same class; we'd just be skipping our abstraction | +| SQS / EventBridge | Skip | Cloud-vendor coupling | + +--- + +## 8. Risks & Mitigations + +| Risk | Mitigation | +| --------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------- | +| "Weak in-house Temporal" drift | Contract borrows Temporal terminology (workflowId, runId, signals, child runs) so future migration is mechanical | +| Step idempotency misuse | Executor requires handlers declare `idempotent: true`; non-idempotent defaults to single-attempt + failover-safe retry only on transient errors | +| Lease contention at scale | Leases include random jitter; stuck runs auto-released after 3× heartbeat period | +| Product teams bypass to run local loops | Deprecation + budget-module enforcement: unregistered runs get 0 budget | + +--- + +## 9. Acceptance Checklist + +- [ ] Phase 1 — run model persistent; REST complete +- [ ] Phase 2 — executor survives restart; idempotency proven +- [ ] Phase 3 — A2A + cowork both integrated +- [ ] Phase 4 — admin UI shipped; one product backend migrated +- [ ] Phase 5 — 300-concurrent load test passes +- [ ] `platform-client` published with new methods +- [ ] `docs/ECOSYSTEM_ARCHITECTURE.md` + AGENTS.md updated +- [ ] Roadmap moved to `docs/roadmaps/completed/` diff --git a/docs/roadmaps/scaffolded/platform_DURABLE_EVENT_BUS_AND_WORKER_RUNTIME_ROADMAP_apr17.md b/docs/roadmaps/scaffolded/platform_DURABLE_EVENT_BUS_AND_WORKER_RUNTIME_ROADMAP_apr17.md new file mode 100644 index 00000000..d4d8a3d0 --- /dev/null +++ b/docs/roadmaps/scaffolded/platform_DURABLE_EVENT_BUS_AND_WORKER_RUNTIME_ROADMAP_apr17.md @@ -0,0 +1,143 @@ +# Durable Event Bus & Worker Runtime — Detailed Roadmap (2026-04-17) + +> **Supersedes:** `platform_DURABLE_EVENT_BUS_AND_WORKER_RUNTIME_ROADMAP.md` (scaffold). +> **Status:** Planned — unblocks Agent Runtime, Approval Queue, Knowledge ingestion. +> **Primary surfaces:** `packages/events/`, `packages/queue/`, `packages/event-store/`, `services/platform-service/`. +> **Estimated effort:** 3–4 weeks (1 engineer) / 2 weeks (2 engineers). +> **Priority:** **P0** — foundation for Categories C3/C4 and for reliable cross-service side effects. + +--- + +## 1. Current State (as of 2026-04-17) + +| Surface | State | Evidence | +| ----------------------- | ------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------ | +| `@bytelyst/events` | `EventBus` (memory) + `DurableEventBus` interface + typed schemas (AgentRun, AgentTask, AgentDispatch, AgentCheckpoint, Artifact, …) | `packages/events/src/index.ts` | +| `@bytelyst/event-store` | Cosmos-backed append-only event store (exists, underused) | `packages/event-store/src/` | +| `@bytelyst/queue` | `QueueWorker` + `MemoryQueueStore` + `FileQueueStore`. No Redis, no cross-process delivery | `packages/queue/src/index.ts` | +| Platform subscribers | `delivery`, `notifications`, `audit`, `runs/tracker.ts` emit on the in-proc memory bus; cross-service = ad-hoc HTTP | `services/platform-service/src/modules/*/` | +| Worker processes | None standardized. Scheduled jobs embed inside fastify via setInterval / per-module timers | n/a | + +**Net:** we have durable-queue and durable-event-store primitives but no **standardized durable bus or worker runtime** consuming them. Production flows still ride the in-memory bus. + +--- + +## 2. Success Criteria (V1) + +1. One event emitted in `platform-service` is reliably consumed by `cowork-service`, `extraction-service`, or `mcp-server` within 1s P95, surviving restart on either side. +2. Every durable subscriber records ack / retry / DLQ counts; operators can query/replay from Cosmos. +3. A single `workers/` bootstrap pattern is documented and used by ≥3 side-effect consumers. +4. No new module added after Phase 3 relies on the in-memory bus for production-critical flows. +5. `@bytelyst/events` memory adapter remains the default in **test environments only** (`NODE_ENV=test`). + +--- + +## 3. Non-Goals (V1) + +- Kafka-scale throughput, exactly-once delivery, cross-region replication +- Full Temporal semantics (that's the Agent Runtime roadmap) +- Replacing the HTTP REST layer between services + +--- + +## 4. Dependencies + +- None (this is the foundation). Enables: Agent Runtime Orchestration, Human Review Approval Queue, Knowledge RAG ingestion. + +--- + +## 5. Phase Plan + +### Phase 1 — Pluggable Event Bus Backends (week 1) + +- [ ] **1.1** Formalize `DurableEventAdapter` contract in `@bytelyst/events` (publish, subscribe, ack, nack, dlq, replay) +- [ ] **1.2** Keep `MemoryAdapter` as the default for tests +- [ ] **1.3** Implement `CosmosEventStoreAdapter` on top of existing `@bytelyst/event-store` (append + tail via change-feed) + - [ ] Partition key: `productId` + day-bucket + - [ ] Consumer groups via lease-collection (reuse `packages/cosmos/src/lease-manager` if present; else add lightweight doc-based lease) +- [ ] **1.4** Add `correlationId` + `causationId` to every `PlatformEventSchemas` entry (schema-breaking; ship behind zod `.catchall()` fallback for unknown fields) +- [ ] **1.5** Add `packages/events/src/durable.test.ts` — adapter conformance suite (publish/subscribe/ack/retry/DLQ) + +**Exit:** `pnpm --filter @bytelyst/events test` covers the Cosmos adapter using the Cosmos emulator. + +--- + +### Phase 2 — Standardized Worker Runtime (week 2) + +- [ ] **2.1** New surface: `services/platform-service/src/workers/` with a single `createWorker(opts)` bootstrap + - [ ] Concurrency control (default: `AGENT_WORKER_CONCURRENCY=4`) + - [ ] Lease heartbeat (default 30s, configurable per handler) + - [ ] Graceful shutdown on SIGINT/SIGTERM + - [ ] Pino child logger with `workerName`, `eventId`, `correlationId` bindings +- [ ] **2.2** Expose `/ops/workers/*` admin routes: list workers, inspect in-flight leases, force-replay DLQ item, pause/resume +- [ ] **2.3** Health: each worker reports into `GET /api/health/dependencies` under a `workers` key +- [ ] **2.4** Operator command: `pnpm --filter @lysnrai/platform-service worker:run ` for running a single worker out-of-process (k8s-ready) +- [ ] **2.5** Dead-letter inspection: Cosmos container `events-dlq` with 30-day TTL; admin UI can re-enqueue + +**Exit:** `reviews` module's notification side-effects migrated to durable worker. Kill the Fastify process mid-emit; event still delivered after restart. + +--- + +### Phase 3 — Migrate Production Side Effects (week 3) + +- [ ] **3.1** Audit in-memory `EventBus.emit()` call sites (expect ≤20) and categorize: + - Critical (must durably deliver): auth-events, run-state-changes, billing-events, approval-events + - Best-effort (OK to lose): dev-time dashboards, local UI refresh hints +- [ ] **3.2** Migrate critical callers to `DurableEventBus` one module at a time: + - [ ] `auth` side-effects (login audit, MFA trigger) + - [ ] `runs/tracker.ts` state transitions + - [ ] `billing-checkout` / `dunning` / `subscriptions` webhooks + - [ ] `notifications` fan-out + - [ ] `delivery` outbox +- [ ] **3.3** Replace per-module `setInterval` schedulers with registered durable workers (e.g., `retention/`, `exports/`, `predictive-analytics/`) +- [ ] **3.4** Telemetry: emit `event_published_total`, `event_consumed_total`, `event_dlq_total`, `worker_lease_expired_total` via `@bytelyst/backend-telemetry` +- [ ] **3.5** Grafana dashboard: event-lag P95 per subscriber, DLQ size, worker concurrency saturation + +**Exit:** post-migration, no production code path emits to `new EventBus()` in `NODE_ENV=production`. Enforced by an ESLint rule in `eslint.config.js` (`no-restricted-imports` scoped to `@bytelyst/events/memory`). + +--- + +### Phase 4 — Cross-Service Subscriptions (week 3–4) + +- [ ] **4.1** `cowork-service` subscribes to `agent.run.*` events and mirrors state into its own tracker (currently polls `platform-service/runs`) +- [ ] **4.2** `extraction-service` subscribes to `knowledge.ingest.requested` (see RAG roadmap) instead of exposing REST +- [ ] **4.3** `mcp-server` emits `a2a.pipeline.step.*` events that become the audit trail source of truth +- [ ] **4.4** Compose: add a Redis container to `docker-compose.yml` as an **optional** fast-path adapter (later phase) + +**Exit:** `pnpm prototype:self-test` exercises one end-to-end cross-service event. + +--- + +## 6. Tech Stack Decision + +| Option | V1 default? | Rationale | +| ------------------------------ | ------------------------ | ------------------------------------------------------------------------ | +| **Cosmos change-feed adapter** | ✅ Yes | Reuses existing emulator, persistence story, Azure runtime; no new infra | +| Redis Streams | Later (Phase 4 optional) | Lower latency; add after Cosmos proves semantics | +| NATS JetStream | Deferred | Best long-term; reassess at Phase 5+ | +| Kafka | Never (overkill) | — | + +**Default decision:** Cosmos-backed adapter first → add Redis optional adapter at Phase 4 → revisit NATS once event volume > 10 events/sec sustained. + +--- + +## 7. Risks & Mitigations + +| Risk | Mitigation | +| ---------------------------------- | ---------------------------------------------------------------------------------------------------------- | +| Cosmos change-feed lag under burst | Per-consumer lease + configurable `maxItemCount`; alert on lag > 5s | +| Schema drift breaks replay | Every event carries `schemaVersion`; adapters refuse unknown major versions; replay path calls migrator | +| Workers multiply process count | `services/platform-service` runs workers in-process by default; only split out when concurrency demands it | +| In-memory bus regressions | Lint rule + CI grep blocks `new EventBus()` outside `*/test/*` paths | + +--- + +## 8. Acceptance Checklist + +- [ ] Phase 1 complete — adapter conformance suite passes on memory + Cosmos +- [ ] Phase 2 complete — `/ops/workers` + DLQ admin routes shipped +- [ ] Phase 3 complete — ≥5 modules migrated; ESLint guard active +- [ ] Phase 4 complete — one cross-service flow demonstrated end-to-end +- [ ] Grafana dashboard `platform-events` live +- [ ] AGENTS.md + `docs/ECOSYSTEM_ARCHITECTURE.md` updated +- [ ] Roadmap moved to `docs/roadmaps/completed/` diff --git a/docs/roadmaps/scaffolded/platform_HUMAN_REVIEW_APPROVAL_QUEUE_ROADMAP_apr17.md b/docs/roadmaps/scaffolded/platform_HUMAN_REVIEW_APPROVAL_QUEUE_ROADMAP_apr17.md new file mode 100644 index 00000000..ddc905e0 --- /dev/null +++ b/docs/roadmaps/scaffolded/platform_HUMAN_REVIEW_APPROVAL_QUEUE_ROADMAP_apr17.md @@ -0,0 +1,173 @@ +# Human Review & Approval Queue — Detailed Roadmap (2026-04-17) + +> **Supersedes:** `platform_HUMAN_REVIEW_APPROVAL_QUEUE_ROADMAP.md` (scaffold). +> **Status:** Planned — critical safety rail for autonomous agents. +> **Primary surfaces:** `services/platform-service/src/modules/reviews/`, new `modules/approvals/`, new `modules/escalations/`, `dashboards/admin-web/`. +> **Estimated effort:** 2–3 weeks (1 engineer). +> **Priority:** **P1** — gates every write-capable agent action to avoid blast radius incidents. + +--- + +## 1. Current State (as of 2026-04-17) + +| Surface | State | Evidence | +| ---------------------------------- | ----------------------------------------------------------------------------------------------------- | ----------------------------------------------- | +| `platform-service/modules/reviews` | Module exists: `notifications.ts`, `repository`, `routes`, `types` — partial implementation | 6 files in `reviews/` | +| MFA push-approval | Narrow auth flow only | `auth` module | +| Cowork destructive-action guard | `CoworkPermissionPrompter` emits `approval-required` Tauri event; resolves in-process | `crates/cowork-orchestrator/src/permissions.rs` | +| A2A pipeline pause-for-review | Not implemented | — | +| Product backend approval flows | Ad-hoc per product (JarvisJr, NomGap have custom) | grep confirms duplication | +| Notifications | `notifications` module + delivery channels (Slack stub, email via Mailpit, push via `@bytelyst/push`) | `notifications/` | + +**Net:** `reviews` module has skeleton but isn't the **generic human-in-the-loop system**. Every approval currently shipped is bespoke. + +--- + +## 2. Success Criteria (V1) + +1. Any agent run can emit `waiting_for_review` with a `reviewReason` and a decision contract; a reviewer's decision resumes or cancels the run. +2. Cowork destructive actions (file delete, shell `rm`, commit push) route through the platform reviews module when `remote_approval_enabled=true` — otherwise keep local prompter. +3. Reviewers see one inbox across products; can bulk-claim, add comments, and make decisions via admin UI or API. +4. SLA timers: unclaimed reviews escalate at T1, auto-reject at T2 (configurable per risk level). +5. Full audit trail: every decision captured with reviewer identity, timestamp, reason, supporting evidence pointers. +6. One real product backend (JarvisJr or NomGap) removes its custom approval flow in favor of this. + +--- + +## 3. Non-Goals (V1) + +- Full BPM suite or arbitrary approval-chain designer +- End-user case portals (reviewer-only UX in V1) +- Risk-score auto-decisions (deferred to V2 with policy engine) +- Legal-hold / eDiscovery extensions + +--- + +## 4. Dependencies + +- **Hard:** Durable Event Bus (C1) — for review events and notification fan-out +- **Hard:** Agent Runtime Orchestration (C3) — for `waiting_for_review` state integration +- **Soft:** Org & Workspace RBAC scaffolded roadmap — for reviewer assignment by role + +--- + +## 5. Data Model + +New Cosmos containers (PK in **bold**): + +| Container | PK | Purpose | +| -------------------- | ------------- | --------------------------------------------------------------------------------------- | +| `reviews` | **productId** | `ReviewDoc`: subjectType, subjectId, riskLevel, requiredDecision, assignees, state, sla | +| `review-decisions` | **reviewId** | `DecisionDoc`: decision, reviewerId, comment, decidedAt, attachments | +| `review-escalations` | **reviewId** | `EscalationDoc`: level, triggeredAt, reassignedTo, reason | + +Subject types: `agent_run_step`, `external_action`, `content_publication`, `prompt_change`, `low_confidence_output`, `tool_invocation`, `custom` (with `subjectPayload` JSON). + +States: `pending → claimed → (approved | rejected | expired | superseded)`. + +Risk levels: `low | medium | high | critical` → dictates SLA (24h / 8h / 2h / 30m default). + +--- + +## 6. Phase Plan + +### Phase 1 — Generic Review Object (week 1) + +- [ ] **1.1** Extend `reviews/types.ts`: `ReviewDoc`, `DecisionDoc`, `EscalationDoc` zod schemas +- [ ] **1.2** Repository: `createReview`, `getReview`, `listReviews`, `claimReview`, `submitDecision`, `expireReview`, `reassign` +- [ ] **1.3** REST surface: + - [ ] `POST /api/reviews` (service-to-service) + - [ ] `GET /api/reviews/:id` + - [ ] `GET /api/reviews` (filters: product, workspace, state, reviewer, risk, age) + - [ ] `POST /api/reviews/:id/claim` + - [ ] `POST /api/reviews/:id/decide { decision, comment }` + - [ ] `POST /api/reviews/:id/reassign { reviewerIds }` +- [ ] **1.4** Reviewer authz: `canReview(productId, workspaceId, userId, riskLevel)` helper (reuse `@bytelyst/org-client`) +- [ ] **1.5** 20+ route tests covering ACL edges (cross-product, cross-workspace, expired claim) + +**Exit:** create a review via curl → claim → decide → state machine enforced. + +--- + +### Phase 2 — Workflow & Escalation (week 2) + +- [ ] **2.1** SLA engine: background worker (on Durable Event Bus) scans pending reviews; escalates at T1, expires at T2 +- [ ] **2.2** Emit events: `review.created`, `review.claimed`, `review.decided`, `review.escalated`, `review.expired` +- [ ] **2.3** Agent Runtime integration: + - Agent run step can call `requestReview(subject, risk)` → step state `waiting_for_review` → run state `waiting_for_review` + - `review.decided` event signals run; `approved` → continue, `rejected`/`expired` → fail step +- [ ] **2.4** Escalation policies: per-product JSON config in `reviews/policies.ts` (default + product overrides) +- [ ] **2.5** Supersession: if a new review for same subject arrives, the old one auto-`superseded` with audit link + +**Exit:** integration test: agent run requests review → 5s later auto-escalates → reviewer decides → run resumes. + +--- + +### Phase 3 — Reviewer Delivery + UI (week 2–3) + +- [ ] **3.1** Notification fan-out on `review.created` + `review.escalated`: + - Email via Mailpit / production SMTP + - Slack channel webhook (reuse `notifications`) + - Push via `@bytelyst/push` to assigned reviewer's devices + - Telegram (optional adapter) +- [ ] **3.2** Admin UI `dashboards/admin-web/app/ops/reviews/`: + - Queue view with filters + - Bulk-claim + - Decision form with evidence pane (renders `subjectPayload`) + - Timeline per review (events + escalations + decision) +- [ ] **3.3** Reviewer mobile hook: `@bytelyst/react-native-platform-sdk` exposes `useReviewInbox()` +- [ ] **3.4** Keyboard shortcuts for reviewers: `j/k` navigate, `a` approve, `r` reject, `c` comment, `e` escalate + +**Exit:** reviewer completes 20 real reviews end-to-end with <30s median decision time on low-risk items. + +--- + +### Phase 4 — Cowork + First Product Migration (week 3) + +- [ ] **4.1** Cowork integration: + - New flag `remote_approval_enabled` on cowork-service + - When true, `CoworkPermissionPrompter` calls `platform-service POST /api/reviews` instead of emitting local Tauri event + - Polls / subscribes for decision; resumes tool call on approval +- [ ] **4.2** Rust-side adapter in `crates/cowork-orchestrator/src/permissions.rs` with HTTP fallback when platform unreachable → local prompt +- [ ] **4.3** Migrate one product backend (recommend JarvisJr) off custom approval flow: + - Identify all `needsApproval`/`pendingConfirm` sites + - Replace with `platformClient.reviews.create()` + - Delete product-local approval tables after 30-day sunset +- [ ] **4.4** Document migration playbook for remaining products + +**Exit:** JarvisJr custom approval code deleted; all approvals in admin queue. + +--- + +## 7. Tech Stack Decision + +| Option | V1? | Rationale | +| ------------------------------------------------------------------------ | ------ | ---------------------------------------------- | +| **Platform module + Durable Event Bus + existing notification channels** | ✅ Yes | Aligned with repo; zero new infra | +| External ticketing (Jira / Linear) | No | Control-plane fit poor; audit trail fragmented | +| Dedicated BPM (Camunda, Flowable) | No | Overkill; heavyweight | +| Policy engine (OpenFGA / Cedar) | Later | Add once policy-as-code becomes a need | + +--- + +## 8. Risks & Mitigations + +| Risk | Mitigation | +| -------------------------------------- | ----------------------------------------------------------------------------------------------------------- | +| Review queue becomes dead-letter inbox | Mandatory SLA; auto-expire + escalate; dashboard surfaces queue age | +| Inconsistent policy across products | Single `reviews/policies.ts` module; per-product overrides validated in CI | +| Reviewer fatigue | Risk-based routing: low-risk items batch-decide; ship bulk-claim UI on day 1 | +| Unaudited decisions | Every decision doc encrypted via `@bytelyst/field-encrypt` for `comment` field; reviewer identity immutable | +| Cowork offline → blocks the agent | HTTP timeout 5s → fall back to local `CoworkPermissionPrompter`; decision reconciled on next reconnect | + +--- + +## 9. Acceptance Checklist + +- [ ] Phase 1 — generic review object + REST complete +- [ ] Phase 2 — SLA engine + agent-run integration live +- [ ] Phase 3 — admin UI + notifications across Slack/email/push +- [ ] Phase 4 — cowork remote-approval flag + 1 product migration shipped +- [ ] 30-day sunset plan documented for remaining product-local approval code +- [ ] `docs/ECOSYSTEM_ARCHITECTURE.md` + AGENTS.md updated +- [ ] Roadmap moved to `docs/roadmaps/completed/` diff --git a/docs/roadmaps/scaffolded/platform_KNOWLEDGE_RAG_SERVICE_ROADMAP_apr17.md b/docs/roadmaps/scaffolded/platform_KNOWLEDGE_RAG_SERVICE_ROADMAP_apr17.md new file mode 100644 index 00000000..29c167d1 --- /dev/null +++ b/docs/roadmaps/scaffolded/platform_KNOWLEDGE_RAG_SERVICE_ROADMAP_apr17.md @@ -0,0 +1,164 @@ +# Knowledge & RAG Service — Detailed Roadmap (2026-04-17) + +> **Supersedes:** `platform_KNOWLEDGE_RAG_SERVICE_ROADMAP.md` (scaffold). +> **Status:** Planned — highest user-visible leverage once Event Bus lands. +> **Primary surfaces:** `services/platform-service/src/modules/knowledge/`, `services/extraction-service/`, `packages/palace/`, new `packages/rag-client/`. +> **Estimated effort:** 4–6 weeks (1 engineer) / 3 weeks (2 engineers). +> **Priority:** **P1** — converges cowork institutional knowledge + MindLyst MemPalace + NoteLett palace + JarvisJr context. + +--- + +## 1. Current State (as of 2026-04-17) + +| Surface | State | Evidence | +| ------------------------------------ | ------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------ | +| `platform-service/modules/knowledge` | Basic CRUD over knowledge docs — no chunking, no embeddings, no retrieval API | `services/platform-service/src/modules/knowledge/{routes,repository,types}.ts` | +| `@bytelyst/palace` | MindLyst-style memory package with embeddings scaffolding (19 files) | `packages/palace/` | +| `extraction-service` | PDF/xlsx/docx text extraction via Python sidecar — no embedding step | `services/extraction-service/` | +| `packages/llm-router` | Deterministic router (provider/model selection); no embeddings client yet | `packages/llm-router/` | +| Cowork `InstitutionalKnowledge` | Rust-side per-project `.md`/`.txt` loader (≤100KB scan) | `crates/cowork-orchestrator/src/admin.rs` | +| Products using RAG | MindLyst (custom Palace), NoteLett (custom Palace), LysnrAI (none), JarvisJr (none), ChronoMind (none) | grep of product backends | + +**Net:** every product wants RAG; nothing shared; extraction exists but stops at "text + metadata". No citation surface, no ACL. + +--- + +## 2. Success Criteria (V1) + +1. One `POST /api/knowledge/documents` call ingests a PDF → extraction → chunking → embedding → indexed and queryable within 60s P95. +2. `POST /api/knowledge/search` returns chunks with citations (`documentId`, `pageNumber`, `quoteStart`, `quoteEnd`, `score`) scoped by `productId` + `workspaceId`. +3. MindLyst + NoteLett both swap their custom palace indexers for `@bytelyst/rag-client` without UX regression. +4. Cowork's `InstitutionalKnowledge` can optionally call `@bytelyst/rag-client` instead of local file scan when the flag `knowledge_rag_enabled` is on. +5. Tenant isolation verified by integration test: product A query cannot retrieve product B docs even when bucket names overlap. + +--- + +## 3. Non-Goals (V1) + +- Multi-hop retrieval / graph-RAG +- Hybrid reranking with a dedicated cross-encoder model +- User-facing document editors / rich collaboration +- Non-text modalities (image/audio embeddings) — deferred to V2 + +--- + +## 4. Dependencies + +- **Hard:** Durable Event Bus (C1) for ingestion pipeline reliability +- **Soft:** Org & Workspace RBAC (existing `@bytelyst/org-client` is sufficient for V1; deeper ACL comes with SCIM roadmap) + +--- + +## 5. Data Model + +New Cosmos containers (partition key in **bold**): + +| Container | PK | Purpose | +| ----------------------- | -------------- | ------------------------------------------------------------------------------ | +| `knowledge-documents` | **productId** | Doc metadata (title, source URI, mime, sha256, workspaceId, createdBy, status) | +| `knowledge-chunks` | **documentId** | Text + embedding ref + page/line spans + tokens | +| `knowledge-embeddings` | **documentId** | Vector payload; separate from chunks to support vector-store migration | +| `knowledge-ingest-jobs` | **productId** | Job lifecycle (queued/extracting/embedding/indexed/failed) | + +Every doc carries `productId` (house rule) + `workspaceId` + `sourceHash` for dedup. + +--- + +## 6. Phase Plan + +### Phase 1 — Document Ingestion Pipeline (week 1–2) + +- [ ] **1.1** Extend `knowledge/types.ts` with `KnowledgeDocumentDoc`, `KnowledgeChunkDoc`, `KnowledgeIngestJobDoc` +- [ ] **1.2** `POST /api/knowledge/documents` creates document row → emits `knowledge.ingest.requested` event +- [ ] **1.3** Worker (via Durable Event Bus): pulls event → fetches source (blob URL or inline bytes) → calls `extraction-service` +- [ ] **1.4** Extraction worker returns normalized Markdown + per-chunk anchors (page, line range) +- [ ] **1.5** Chunker (new `packages/rag-chunker`): + - sliding-window 800 tokens, 120 overlap (configurable) + - preserves heading hierarchy for "section" metadata + - 25+ unit tests: markdown tables, code fences, multi-page PDFs, RTL text +- [ ] **1.6** Dedup: if `sourceHash` exists for `productId`+`workspaceId`, skip re-embedding; update references only + +**Exit:** Upload `AGENTS.md` → within 60s, 10–30 chunks stored with anchors. + +--- + +### Phase 2 — Embedding + Retrieval (week 2–3) + +- [ ] **2.1** New `@bytelyst/llm-router` subpath: `llmRouter.embedBatch({ texts, model })` — default `text-embedding-3-small` +- [ ] **2.2** Embedding worker consumes `knowledge.chunks.ready` events + - Batching: up to 256 chunks / request + - Retry + DLQ via Durable Event Bus + - Cost accounting wired to `ai-budgets` module +- [ ] **2.3** Vector store — **primary** pgvector via new `packages/vector-store/` adapter + - `docker-compose.yml` adds Postgres 16 + pgvector extension + - AKV binding: `AZURE_POSTGRES_CONNECTION_STRING` + - Fallback: Cosmos native vector search (when Azure emulator supports it) as secondary adapter +- [ ] **2.4** `POST /api/knowledge/search { query, topK, filters: { workspaceId, tags, dateRange } }`: + - Embeds query (cache by sha256 of normalized query, 5-min TTL) + - pgvector k-NN with metadata filter pushdown + - Returns `{ chunkId, documentId, score, text, citation: { page, lineStart, lineEnd } }` +- [ ] **2.5** `POST /api/knowledge/assemble { query, maxTokens, productId }` — composes a context block under `maxTokens` with inline citations (`[^1]` syntax) + +**Exit:** integration test: upload a 50-page PDF → query → answer includes citations pointing to real page numbers. + +--- + +### Phase 3 — Client SDK + Access Control (week 4) + +- [ ] **3.1** New package `@bytelyst/rag-client` (mirrors existing client package patterns — see `subscription-client`, `billing-client`) + - `createRagClient({ baseUrl, token, productId })` + - Methods: `uploadDocument`, `search`, `assemble`, `getDocument`, `deleteDocument`, `listDocuments` + - TypeScript types generated from platform zod schemas +- [ ] **3.2** RBAC: every request checks `workspaceId` membership via `@bytelyst/org-client` +- [ ] **3.3** Tenant isolation test suite (`knowledge-acl.test.ts`) — 10+ cases including cross-product attack scenarios +- [ ] **3.4** Field-encrypt sensitive docs at rest (via `@bytelyst/field-encrypt`) — encrypt `text` column; chunk vectors remain plain (they're derived and low-info) + +**Exit:** NoteLett integrates `@bytelyst/rag-client` in a branch; local E2E still passes. + +--- + +### Phase 4 — Connectors + Operator UX (week 5–6) + +- [ ] **4.1** Connector: file upload via blob (reuse `@bytelyst/blob`) +- [ ] **4.2** Connector: URL crawl (single-page) using existing Python sidecar +- [ ] **4.3** Connector: Gitea repo ingest (`.md` + `AGENTS.md` across a list of repos) — feeds cowork's knowledge loop +- [ ] **4.4** Admin UI in `dashboards/admin-web/app/ops/knowledge/` — list, reindex, delete, see ingestion DLQ +- [ ] **4.5** Metrics: `knowledge_ingest_latency_seconds`, `knowledge_search_latency_ms`, `knowledge_embedding_tokens_total`, surfaced in Grafana + +**Exit:** 20 real `AGENTS.md` files across the workspace ingested; cowork query returns citations. + +--- + +## 7. Tech Stack Decision + +| Vector Backend | V1? | Rationale | +| --------------------------- | ----------------- | ----------------------------------------------------------- | +| **pgvector on Postgres 16** | ✅ Yes | Unified metadata+vector, SQL filtering, ACID for ACL, cheap | +| Qdrant | Deferred | Better pure-vector perf; adds an extra system | +| Cosmos vector search | Secondary adapter | Useful in Azure-only envs; still evolving | +| Azure AI Search | No | Vendor lock-in; less control | + +**Default:** pgvector primary, Cosmos secondary. Adapter interface allows swap without touching callers. + +--- + +## 8. Risks & Mitigations + +| Risk | Mitigation | +| ----------------------------------- | --------------------------------------------------------------------------------- | +| Retrieval leaks cross-tenant data | ACL filter enforced in adapter layer, not app layer; adversarial test suite | +| Embedding cost balloons | `ai-budgets` enforcement at dispatch + dedup by sourceHash + query cache | +| Stale embeddings after model change | `embeddingModel` recorded per chunk; reindex job triggered by model registry bump | +| Product teams bypass the service | Deprecation path: mark palace methods `@deprecated`, CI warning, 90-day sunset | + +--- + +## 9. Acceptance Checklist + +- [ ] Phase 1 — ingestion pipeline live; DLQ visible in admin +- [ ] Phase 2 — pgvector in compose; search returns citations +- [ ] Phase 3 — `@bytelyst/rag-client` published; NoteLett migration branch green +- [ ] Phase 4 — admin UI + Gitea connector + Grafana dashboard +- [ ] Cowork `InstitutionalKnowledge` opt-in migration path documented +- [ ] `docs/ECOSYSTEM_ARCHITECTURE.md` updated with the RAG substrate +- [ ] Roadmap moved to `docs/roadmaps/completed/`