bytelyst-devops-tools/docs/INTERVIEW/04-enhancement-roadmap.md
Hermes VM 076449268b docs(interview): add Senior Agentic RAG Architect prep kit
7-doc kit mapping the JD competency matrix to the ByteLyst ecosystem:
ecosystem-as-RAG-fabric architecture, competency deep-dives, STAR bank,
enhancement roadmap, banking blueprints, and a glossary quick-ref.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-31 10:48:52 +00:00

9.3 KiB

04 · Enhancement Roadmap — make every claim literally true

This is the "what would you build here" answer, and it doubles as a real backlog. Each enhancement turns an adjacent capability into a shipped one on infrastructure we already run. They're ordered so each builds on the last; the whole set is a credible "agentic-RAG fabric, hardened" program.

Mapping note: these slot into the existing repo conventions — new code under learning_ai_common_plat/packages + a services/rag-service, eval harness surfaced in learning_ai_devops_tools/dashboard (Hermes), and ADRs under learning_ai_devops_tools/docs/adr/. Cut tracker items via scripts/tracker-seed/.

flowchart LR
    A["§A LangGraph port<br/>+ A2A agent cards"] --> B["§B Hybrid retrieval<br/>pgvector+BM25+rerank+HyDE/CRAG/Self-RAG"]
    B --> C["§C Guarded Text-to-SQL<br/>read-only views + RLS"]
    B --> D["§D Cosmos Gremlin<br/>knowledge graph + Graph RAG"]
    B --> E["§E RAGAS/DeepEval harness<br/>+ drift monitor in Hermes"]
    C & D --> F["§F Model-card registry<br/>+ governance pack"]
    E --> F
    classDef p1 fill:#dcfce7,stroke:#16a34a
    classDef p2 fill:#fef9c3,stroke:#ca8a04
    classDef p3 fill:#fee2e2,stroke:#dc2626
    class A,B p1
    class C,D,E p2
    class F p3
Phase Enhancements Why now
P1 (foundation) §A, §B Orchestration + retrieval are the spine; everything else attaches to them.
P2 (sources + quality) §C, §D, §E Add structured + graph sources and the eval loop that proves quality.
P3 (governance) §F Wrap the now-real fabric in the regulated-grade governance story.

§A — Port agent-queue topology onto LangGraph + add A2A handoff

Goal: make the "prod-grade LangGraph" claim literal while keeping the proven state model.

  • New packages/agent-graph: a typed StateGraph with nodes route → retrieve → grade → (rewrite) → generate → critique, conditional + cyclic edges, and a checkpointer backed by event-store.
  • Keep agent-queue's engine-selection idea as node-level model binding through llm-router.
  • Expose each product agent with an A2A agent card (capabilities, auth scope, cost hints) so a supervisor agent can delegate; the card is served by mcp-server.
stateDiagram-v2
    [*] --> route
    route --> retrieve: needs evidence
    route --> generate: parametric/FAQ
    retrieve --> grade
    grade --> rewrite: low relevance (CRAG)
    rewrite --> retrieve
    grade --> generate: ok
    generate --> critique
    critique --> rewrite: ungrounded (Self-RAG)
    critique --> [*]: grounded + cited

Acceptance: a LangGraph run with a forced low-relevance retrieval demonstrably loops through rewrite; checkpoints land in event-store; one product reachable via A2A card. Effort: M. Risk: low (mapping is 1:1 with today's state machine).


§B — Hybrid retrieval: pgvector + BM25 + rerank + HyDE / CRAG / Self-RAG

Goal: turn "I understand hybrid RAG" into a running services/rag-service.

  • pgvector alongside the existing Postgres → one DB, one backup, transactional consistency with source rows; schema-per-tenant namespaces (mirrors productId).
  • BM25 lexical (Postgres FTS or an OpenSearch sidecar) fused with vector via RRF.
  • Cross-encoder rerank (bge-reranker or ColBERT late-interaction) on the fused candidates; context compression to fit budget.
  • HyDE query rewriter node; CRAG relevance gate; Self-RAG groundedness critic (the §A nodes).
  • Layout-aware ingestion in extraction-service: PyMuPDF / Unstructured.io, OCR fallback, table preservation, page/section provenance on every chunk.
flowchart LR
    Q --> HYDE[HyDE rewrite] --> EMB[embed]
    EMB --> VEC[(pgvector ANN)]
    Q --> BM[(BM25)]
    VEC & BM --> RRF[RRF fuse] --> RR[cross-encoder rerank] --> CC[context compress] --> GEN

Acceptance: hybrid beats vector-only on a golden set (context-recall ↑, context-precision ↑); every chunk carries doc/page/section provenance; abstain fires when reranked top-score < τ. Effort: L. Risk: medium (reranker latency budget — mitigate with rerank-top-k only).


§C — Guarded Text-to-SQL tool

Goal: add genuine generative SQL for ad-hoc analytics without the foot-guns.

  • Register a sql-query tool on mcp-server scoped to read-only semantic views (no base tables), with row-level security by tenant/role.
  • Schema-aware retrieval: embed table/column descriptions; retrieve only the relevant schema slice into the prompt (don't dump the catalog).
  • Parse + validate generated SQL (allow-list of statements, forbid cross-schema joins, enforce LIMIT); cost-cap and timeout.
  • Audit every generated query + row count to event-store.

Acceptance: an attempt to read an unentitled column is blocked at the view/RLS layer and logged; a malformed/oversized query is rejected pre-execution. Effort: M. Risk: medium (this is the highest-leakage surface — keep it behind views).


§D — Cosmos Gremlin knowledge graph + Graph RAG

Goal: answer "connected-to" questions (KYC/AML-shaped) on infra we already run.

  • Use the existing Azure Cosmos DB Gremlin API. Entity/relation extraction at ingest (from extraction-service output + structured rows) builds the graph.
  • Graph-augmented retrieval: vector hit seeds an entry node → bounded Gremlin traversal returns the subgraph → fuse subgraph + text chunks into context.
  • Expose a graph-query tool on mcp-server (read-only, depth-bounded).
flowchart LR
    Q --> V[(vector seed)] --> N[entry entity]
    N --> G[(Gremlin traversal<br/>≤2 hops)]
    G --> SUB[subgraph]
    SUB --> FUSE[fuse w/ text chunks] --> GEN

Acceptance: a 2-hop relationship question that vector-only fails is answered correctly with the subgraph cited; traversal depth/time bounded. Effort: L. Risk: medium (graph modeling + traversal cost).


§E — Evaluation harness + factual-drift monitor in Hermes

Goal: make "RAGAS / faithfulness SLAs / drift monitoring" real and visible.

  • Offline (CI): DeepEval pytest-style assertions on a golden set — faithfulness, answer-relevancy, context-precision, context-recall, answer-correctness. Regression below threshold blocks deploy.
  • Online: sample production traces, score with RAGAS / LLM-as-judge, emit metrics via telemetry-client.
  • Hermes pane: a "RAG Quality" panel (extends hermes-ops) trending the five metrics per tenant + a drift alert when faithfulness/recall degrade week-over-week.
  • Wire abstain rate and escalation rate as first-class SLAs.
flowchart TB
    subgraph CI["Offline / CI (DeepEval)"]
        G[golden set] --> SC1[score] --> GATE{≥ SLA?}
        GATE -- no --> BLOCK[block deploy]
        GATE -- yes --> SHIP[ship]
    end
    subgraph PROD["Online (RAGAS / judge)"]
        TR[sampled traces] --> SC2[score] --> TEL[telemetry-client] --> HERMES[Hermes RAG-Quality pane]
        HERMES --> DRIFT{drift?} -- yes --> ALERT[alert + open finding]
    end

Acceptance: a deliberately-degraded retriever fails the CI gate; the Hermes pane shows the five metrics per tenant and fires a drift alert on a seeded regression. Effort: M. Risk: low-medium (judge cost — sample, don't score 100%).


§F — Model-card registry + governance pack

Goal: the regulated-grade documentation/audit layer (SR 11-7 / EU AI Act ready).

  • Model-card registry (a governance package + Hermes pane): per deployed model/agent — purpose, data sources, eval scores, known limits, owner, last-reviewed date, kill-switch link.
  • Decision log: every generation's (query, retrieved sources, model, faithfulness score, abstain/answer) to event-store → reproducible audit trail.
  • RACI doc template per engagement; ADR set under docs/adr/ for each architectural choice.
  • Map controls to SR 11-7 (model inventory, validation, monitoring, change control) and EU AI Act (risk classification, logging, human oversight, transparency) — see 05-banking-blueprints.md.

Acceptance: every production model has a card with current eval scores + owner; any answer can be reconstructed from the decision log; controls trace to named regulatory clauses. Effort: M. Risk: low (mostly assembly over existing event-store/flags/auth).


Sequencing & "what I'd do in the first 90 days" (great closing answer)

gantt
    title Agentic-RAG hardening — 90-day view
    dateFormat X
    axisFormat %s
    section Foundation
    §A LangGraph + A2A      :a, 0, 3
    §B Hybrid retrieval     :b, 1, 5
    section Sources & Quality
    §C Guarded Text-to-SQL  :c, 5, 3
    §D Graph RAG (Gremlin)  :d, 5, 4
    §E Eval harness + drift :e, 4, 4
    section Governance
    §F Model cards + RACI   :f, 8, 3

"In 90 days I'd stand up the retrieval spine and the eval harness first — because you can't tune what you can't measure — then layer structured + graph sources, and close with the governance pack so the whole thing is audit-ready. Notice governance isn't last because it's least important; it's last because by then it's mostly assembling controls the platform already enforces (auth, masking, kill-switch, audit) into cards and RACI."