# 04 · Enhancement Roadmap — make every claim literally true This is the "what would you build here" answer, and it doubles as a real backlog. Each enhancement turns an *adjacent* capability into a *shipped* one on infrastructure we already run. They're ordered so each builds on the last; the whole set is a credible "agentic-RAG fabric, hardened" program. > Mapping note: these slot into the existing repo conventions — new code under > `learning_ai_common_plat/packages` + a `services/rag-service`, eval harness surfaced in > `learning_ai_devops_tools/dashboard` (Hermes), and ADRs under > `learning_ai_devops_tools/docs/adr/`. Cut tracker items via `scripts/tracker-seed/`. ```mermaid flowchart LR A["§A LangGraph port
+ A2A agent cards"] --> B["§B Hybrid retrieval
pgvector+BM25+rerank+HyDE/CRAG/Self-RAG"] B --> C["§C Guarded Text-to-SQL
read-only views + RLS"] B --> D["§D Cosmos Gremlin
knowledge graph + Graph RAG"] B --> E["§E RAGAS/DeepEval harness
+ drift monitor in Hermes"] C & D --> F["§F Model-card registry
+ governance pack"] E --> F classDef p1 fill:#dcfce7,stroke:#16a34a classDef p2 fill:#fef9c3,stroke:#ca8a04 classDef p3 fill:#fee2e2,stroke:#dc2626 class A,B p1 class C,D,E p2 class F p3 ``` | Phase | Enhancements | Why now | |---|---|---| | **P1 (foundation)** | §A, §B | Orchestration + retrieval are the spine; everything else attaches to them. | | **P2 (sources + quality)** | §C, §D, §E | Add structured + graph sources and the eval loop that proves quality. | | **P3 (governance)** | §F | Wrap the now-real fabric in the regulated-grade governance story. | --- ## §A — Port `agent-queue` topology onto LangGraph + add A2A handoff **Goal:** make the "prod-grade LangGraph" claim literal while keeping the proven state model. - New `packages/agent-graph`: a typed `StateGraph` with nodes `route → retrieve → grade → (rewrite) → generate → critique`, conditional + cyclic edges, and a checkpointer backed by `event-store`. - Keep `agent-queue`'s engine-selection idea as **node-level model binding** through `llm-router`. - Expose each product agent with an **A2A agent card** (capabilities, auth scope, cost hints) so a supervisor agent can delegate; the card is served by `mcp-server`. ```mermaid stateDiagram-v2 [*] --> route route --> retrieve: needs evidence route --> generate: parametric/FAQ retrieve --> grade grade --> rewrite: low relevance (CRAG) rewrite --> retrieve grade --> generate: ok generate --> critique critique --> rewrite: ungrounded (Self-RAG) critique --> [*]: grounded + cited ``` **Acceptance:** a LangGraph run with a forced low-relevance retrieval demonstrably loops through `rewrite`; checkpoints land in `event-store`; one product reachable via A2A card. **Effort:** M. **Risk:** low (mapping is 1:1 with today's state machine). --- ## §B — Hybrid retrieval: pgvector + BM25 + rerank + HyDE / CRAG / Self-RAG **Goal:** turn "I understand hybrid RAG" into a running `services/rag-service`. - **pgvector** alongside the existing Postgres → one DB, one backup, transactional consistency with source rows; **schema-per-tenant** namespaces (mirrors `productId`). - **BM25** lexical (Postgres FTS or an OpenSearch sidecar) fused with vector via **RRF**. - **Cross-encoder rerank** (bge-reranker or ColBERT late-interaction) on the fused candidates; **context compression** to fit budget. - **HyDE** query rewriter node; **CRAG** relevance gate; **Self-RAG** groundedness critic (the §A nodes). - **Layout-aware ingestion** in `extraction-service`: PyMuPDF / Unstructured.io, OCR fallback, table preservation, **page/section provenance** on every chunk. ```mermaid flowchart LR Q --> HYDE[HyDE rewrite] --> EMB[embed] EMB --> VEC[(pgvector ANN)] Q --> BM[(BM25)] VEC & BM --> RRF[RRF fuse] --> RR[cross-encoder rerank] --> CC[context compress] --> GEN ``` **Acceptance:** hybrid beats vector-only on a golden set (context-recall ↑, context-precision ↑); every chunk carries doc/page/section provenance; abstain fires when reranked top-score < τ. **Effort:** L. **Risk:** medium (reranker latency budget — mitigate with rerank-top-k only). --- ## §C — Guarded Text-to-SQL tool **Goal:** add genuine generative SQL for ad-hoc analytics without the foot-guns. - Register a `sql-query` tool on `mcp-server` scoped to **read-only semantic views** (no base tables), with **row-level security** by tenant/role. - **Schema-aware retrieval:** embed table/column descriptions; retrieve only the relevant schema slice into the prompt (don't dump the catalog). - Parse + validate generated SQL (allow-list of statements, forbid cross-schema joins, enforce `LIMIT`); cost-cap and timeout. - Audit every generated query + row count to `event-store`. **Acceptance:** an attempt to read an unentitled column is blocked at the view/RLS layer and logged; a malformed/oversized query is rejected pre-execution. **Effort:** M. **Risk:** medium (this is the highest-leakage surface — keep it behind views). --- ## §D — Cosmos Gremlin knowledge graph + Graph RAG **Goal:** answer "connected-to" questions (KYC/AML-shaped) on infra we already run. - Use the existing **Azure Cosmos DB Gremlin** API. Entity/relation extraction at ingest (from `extraction-service` output + structured rows) builds the graph. - **Graph-augmented retrieval:** vector hit seeds an entry node → bounded Gremlin traversal returns the subgraph → fuse subgraph + text chunks into context. - Expose a `graph-query` tool on `mcp-server` (read-only, depth-bounded). ```mermaid flowchart LR Q --> V[(vector seed)] --> N[entry entity] N --> G[(Gremlin traversal
≤2 hops)] G --> SUB[subgraph] SUB --> FUSE[fuse w/ text chunks] --> GEN ``` **Acceptance:** a 2-hop relationship question that vector-only fails is answered correctly with the subgraph cited; traversal depth/time bounded. **Effort:** L. **Risk:** medium (graph modeling + traversal cost). --- ## §E — Evaluation harness + factual-drift monitor in Hermes **Goal:** make "RAGAS / faithfulness SLAs / drift monitoring" real and visible. - **Offline (CI):** **DeepEval** pytest-style assertions on a golden set — faithfulness, answer-relevancy, context-precision, context-recall, answer-correctness. Regression below threshold **blocks deploy**. - **Online:** sample production traces, score with **RAGAS / LLM-as-judge**, emit metrics via `telemetry-client`. - **Hermes pane:** a "RAG Quality" panel (extends `hermes-ops`) trending the five metrics per tenant + a **drift alert** when faithfulness/recall degrade week-over-week. - Wire **abstain rate** and **escalation rate** as first-class SLAs. ```mermaid flowchart TB subgraph CI["Offline / CI (DeepEval)"] G[golden set] --> SC1[score] --> GATE{≥ SLA?} GATE -- no --> BLOCK[block deploy] GATE -- yes --> SHIP[ship] end subgraph PROD["Online (RAGAS / judge)"] TR[sampled traces] --> SC2[score] --> TEL[telemetry-client] --> HERMES[Hermes RAG-Quality pane] HERMES --> DRIFT{drift?} -- yes --> ALERT[alert + open finding] end ``` **Acceptance:** a deliberately-degraded retriever fails the CI gate; the Hermes pane shows the five metrics per tenant and fires a drift alert on a seeded regression. **Effort:** M. **Risk:** low-medium (judge cost — sample, don't score 100%). --- ## §F — Model-card registry + governance pack **Goal:** the regulated-grade documentation/audit layer (SR 11-7 / EU AI Act ready). - **Model-card registry** (a `governance` package + Hermes pane): per deployed model/agent — purpose, data sources, eval scores, known limits, owner, last-reviewed date, kill-switch link. - **Decision log:** every generation's (query, retrieved sources, model, faithfulness score, abstain/answer) to `event-store` → reproducible audit trail. - **RACI doc** template per engagement; **ADR** set under `docs/adr/` for each architectural choice. - Map controls to **SR 11-7** (model inventory, validation, monitoring, change control) and **EU AI Act** (risk classification, logging, human oversight, transparency) — see `05-banking-blueprints.md`. **Acceptance:** every production model has a card with current eval scores + owner; any answer can be reconstructed from the decision log; controls trace to named regulatory clauses. **Effort:** M. **Risk:** low (mostly assembly over existing `event-store`/flags/auth). --- ## Sequencing & "what I'd do in the first 90 days" (great closing answer) ```mermaid gantt title Agentic-RAG hardening — 90-day view dateFormat X axisFormat %s section Foundation §A LangGraph + A2A :a, 0, 3 §B Hybrid retrieval :b, 1, 5 section Sources & Quality §C Guarded Text-to-SQL :c, 5, 3 §D Graph RAG (Gremlin) :d, 5, 4 §E Eval harness + drift :e, 4, 4 section Governance §F Model cards + RACI :f, 8, 3 ``` > *"In 90 days I'd stand up the retrieval spine and the eval harness first — because you > can't tune what you can't measure — then layer structured + graph sources, and close with > the governance pack so the whole thing is audit-ready. Notice governance isn't last > because it's least important; it's last because by then it's mostly **assembling controls > the platform already enforces** (auth, masking, kill-switch, audit) into cards and RACI."*