bytelyst-devops-tools/docs/INTERVIEW/04-enhancement-roadmap.md

# 04 · Enhancement Roadmap — make every claim literally true

This is the "what would you build here" answer, and it doubles as a real backlog. Each
enhancement turns an *adjacent* capability into a *shipped* one on infrastructure we
already run. They're ordered so each builds on the last; the whole set is a credible
"agentic-RAG fabric, hardened" program.

> Mapping note: these slot into the existing repo conventions — new code under
> `learning_ai_common_plat/packages` + a `services/rag-service`, eval harness surfaced in
> `learning_ai_devops_tools/dashboard` (Hermes), and ADRs under
> `learning_ai_devops_tools/docs/adr/`. Cut tracker items via `scripts/tracker-seed/`.

```mermaid
flowchart LR
    A["§A LangGraph port<br/>+ A2A agent cards"] --> B["§B Hybrid retrieval<br/>pgvector+BM25+rerank+HyDE/CRAG/Self-RAG"]
    B --> C["§C Guarded Text-to-SQL<br/>read-only views + RLS"]
    B --> D["§D Cosmos Gremlin<br/>knowledge graph + Graph RAG"]
    B --> E["§E RAGAS/DeepEval harness<br/>+ drift monitor in Hermes"]
    C & D --> F["§F Model-card registry<br/>+ governance pack"]
    E --> F
    classDef p1 fill:#dcfce7,stroke:#16a34a
    classDef p2 fill:#fef9c3,stroke:#ca8a04
    classDef p3 fill:#fee2e2,stroke:#dc2626
    class A,B p1
    class C,D,E p2
    class F p3
```

| Phase | Enhancements | Why now |
|---|---|---|
| **P1 (foundation)** | §A, §B | Orchestration + retrieval are the spine; everything else attaches to them. |
| **P2 (sources + quality)** | §C, §D, §E | Add structured + graph sources and the eval loop that proves quality. |
| **P3 (governance)** | §F | Wrap the now-real fabric in the regulated-grade governance story. |

---

## §A — Port `agent-queue` topology onto LangGraph + add A2A handoff

**Goal:** make the "prod-grade LangGraph" claim literal while keeping the proven state model.

- New `packages/agent-graph`: a typed `StateGraph` with nodes `route → retrieve → grade → (rewrite) → generate → critique`, conditional + cyclic edges, and a checkpointer backed by `event-store`.
- Keep `agent-queue`'s engine-selection idea as **node-level model binding** through `llm-router`.
- Expose each product agent with an **A2A agent card** (capabilities, auth scope, cost hints) so a supervisor agent can delegate; the card is served by `mcp-server`.

```mermaid
stateDiagram-v2
    [*] --> route
    route --> retrieve: needs evidence
    route --> generate: parametric/FAQ
    retrieve --> grade
    grade --> rewrite: low relevance (CRAG)
    rewrite --> retrieve
    grade --> generate: ok
    generate --> critique
    critique --> rewrite: ungrounded (Self-RAG)
    critique --> [*]: grounded + cited
```

**Acceptance:** a LangGraph run with a forced low-relevance retrieval demonstrably loops
through `rewrite`; checkpoints land in `event-store`; one product reachable via A2A card.
**Effort:** M. **Risk:** low (mapping is 1:1 with today's state machine).

---

## §B — Hybrid retrieval: pgvector + BM25 + rerank + HyDE / CRAG / Self-RAG

**Goal:** turn "I understand hybrid RAG" into a running `services/rag-service`.

- **pgvector** alongside the existing Postgres → one DB, one backup, transactional consistency with source rows; **schema-per-tenant** namespaces (mirrors `productId`).
- **BM25** lexical (Postgres FTS or an OpenSearch sidecar) fused with vector via **RRF**.
- **Cross-encoder rerank** (bge-reranker or ColBERT late-interaction) on the fused candidates; **context compression** to fit budget.
- **HyDE** query rewriter node; **CRAG** relevance gate; **Self-RAG** groundedness critic (the §A nodes).
- **Layout-aware ingestion** in `extraction-service`: PyMuPDF / Unstructured.io, OCR fallback, table preservation, **page/section provenance** on every chunk.

```mermaid
flowchart LR
    Q --> HYDE[HyDE rewrite] --> EMB[embed]
    EMB --> VEC[(pgvector ANN)]
    Q --> BM[(BM25)]
    VEC & BM --> RRF[RRF fuse] --> RR[cross-encoder rerank] --> CC[context compress] --> GEN
```

**Acceptance:** hybrid beats vector-only on a golden set (context-recall ↑, context-precision ↑);
every chunk carries doc/page/section provenance; abstain fires when reranked top-score < τ.
**Effort:** L. **Risk:** medium (reranker latency budget — mitigate with rerank-top-k only).

---

## §C — Guarded Text-to-SQL tool

**Goal:** add genuine generative SQL for ad-hoc analytics without the foot-guns.

- Register a `sql-query` tool on `mcp-server` scoped to **read-only semantic views** (no base tables), with **row-level security** by tenant/role.
- **Schema-aware retrieval:** embed table/column descriptions; retrieve only the relevant schema slice into the prompt (don't dump the catalog).
- Parse + validate generated SQL (allow-list of statements, forbid cross-schema joins, enforce `LIMIT`); cost-cap and timeout.
- Audit every generated query + row count to `event-store`.

**Acceptance:** an attempt to read an unentitled column is blocked at the view/RLS layer
and logged; a malformed/oversized query is rejected pre-execution.
**Effort:** M. **Risk:** medium (this is the highest-leakage surface — keep it behind views).

---

## §D — Cosmos Gremlin knowledge graph + Graph RAG

**Goal:** answer "connected-to" questions (KYC/AML-shaped) on infra we already run.

- Use the existing **Azure Cosmos DB Gremlin** API. Entity/relation extraction at ingest (from `extraction-service` output + structured rows) builds the graph.
- **Graph-augmented retrieval:** vector hit seeds an entry node → bounded Gremlin traversal returns the subgraph → fuse subgraph + text chunks into context.
- Expose a `graph-query` tool on `mcp-server` (read-only, depth-bounded).

```mermaid
flowchart LR
    Q --> V[(vector seed)] --> N[entry entity]
    N --> G[(Gremlin traversal<br/>≤2 hops)]
    G --> SUB[subgraph]
    SUB --> FUSE[fuse w/ text chunks] --> GEN
```

**Acceptance:** a 2-hop relationship question that vector-only fails is answered correctly
with the subgraph cited; traversal depth/time bounded.
**Effort:** L. **Risk:** medium (graph modeling + traversal cost).

---

## §E — Evaluation harness + factual-drift monitor in Hermes

**Goal:** make "RAGAS / faithfulness SLAs / drift monitoring" real and visible.

- **Offline (CI):** **DeepEval** pytest-style assertions on a golden set — faithfulness, answer-relevancy, context-precision, context-recall, answer-correctness. Regression below threshold **blocks deploy**.
- **Online:** sample production traces, score with **RAGAS / LLM-as-judge**, emit metrics via `telemetry-client`.
- **Hermes pane:** a "RAG Quality" panel (extends `hermes-ops`) trending the five metrics per tenant + a **drift alert** when faithfulness/recall degrade week-over-week.
- Wire **abstain rate** and **escalation rate** as first-class SLAs.

```mermaid
flowchart TB
    subgraph CI["Offline / CI (DeepEval)"]
        G[golden set] --> SC1[score] --> GATE{≥ SLA?}
        GATE -- no --> BLOCK[block deploy]
        GATE -- yes --> SHIP[ship]
    end
    subgraph PROD["Online (RAGAS / judge)"]
        TR[sampled traces] --> SC2[score] --> TEL[telemetry-client] --> HERMES[Hermes RAG-Quality pane]
        HERMES --> DRIFT{drift?} -- yes --> ALERT[alert + open finding]
    end
```

**Acceptance:** a deliberately-degraded retriever fails the CI gate; the Hermes pane shows
the five metrics per tenant and fires a drift alert on a seeded regression.
**Effort:** M. **Risk:** low-medium (judge cost — sample, don't score 100%).

---

## §F — Model-card registry + governance pack

**Goal:** the regulated-grade documentation/audit layer (SR 11-7 / EU AI Act ready).

- **Model-card registry** (a `governance` package + Hermes pane): per deployed model/agent — purpose, data sources, eval scores, known limits, owner, last-reviewed date, kill-switch link.
- **Decision log:** every generation's (query, retrieved sources, model, faithfulness score, abstain/answer) to `event-store` → reproducible audit trail.
- **RACI doc** template per engagement; **ADR** set under `docs/adr/` for each architectural choice.
- Map controls to **SR 11-7** (model inventory, validation, monitoring, change control) and **EU AI Act** (risk classification, logging, human oversight, transparency) — see `05-banking-blueprints.md`.

**Acceptance:** every production model has a card with current eval scores + owner; any
answer can be reconstructed from the decision log; controls trace to named regulatory clauses.
**Effort:** M. **Risk:** low (mostly assembly over existing `event-store`/flags/auth).

---

## Sequencing & "what I'd do in the first 90 days" (great closing answer)

```mermaid
gantt
    title Agentic-RAG hardening — 90-day view
    dateFormat X
    axisFormat %s
    section Foundation
    §A LangGraph + A2A      :a, 0, 3
    §B Hybrid retrieval     :b, 1, 5
    section Sources & Quality
    §C Guarded Text-to-SQL  :c, 5, 3
    §D Graph RAG (Gremlin)  :d, 5, 4
    §E Eval harness + drift :e, 4, 4
    section Governance
    §F Model cards + RACI   :f, 8, 3
```

> *"In 90 days I'd stand up the retrieval spine and the eval harness first — because you
> can't tune what you can't measure — then layer structured + graph sources, and close with
> the governance pack so the whole thing is audit-ready. Notice governance isn't last
> because it's least important; it's last because by then it's mostly **assembling controls
> the platform already enforces** (auth, masking, kill-switch, audit) into cards and RACI."*