7-doc kit mapping the JD competency matrix to the ByteLyst ecosystem: ecosystem-as-RAG-fabric architecture, competency deep-dives, STAR bank, enhancement roadmap, banking blueprints, and a glossary quick-ref. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
9.3 KiB
04 · Enhancement Roadmap — make every claim literally true
This is the "what would you build here" answer, and it doubles as a real backlog. Each enhancement turns an adjacent capability into a shipped one on infrastructure we already run. They're ordered so each builds on the last; the whole set is a credible "agentic-RAG fabric, hardened" program.
Mapping note: these slot into the existing repo conventions — new code under
learning_ai_common_plat/packages+ aservices/rag-service, eval harness surfaced inlearning_ai_devops_tools/dashboard(Hermes), and ADRs underlearning_ai_devops_tools/docs/adr/. Cut tracker items viascripts/tracker-seed/.
flowchart LR
A["§A LangGraph port<br/>+ A2A agent cards"] --> B["§B Hybrid retrieval<br/>pgvector+BM25+rerank+HyDE/CRAG/Self-RAG"]
B --> C["§C Guarded Text-to-SQL<br/>read-only views + RLS"]
B --> D["§D Cosmos Gremlin<br/>knowledge graph + Graph RAG"]
B --> E["§E RAGAS/DeepEval harness<br/>+ drift monitor in Hermes"]
C & D --> F["§F Model-card registry<br/>+ governance pack"]
E --> F
classDef p1 fill:#dcfce7,stroke:#16a34a
classDef p2 fill:#fef9c3,stroke:#ca8a04
classDef p3 fill:#fee2e2,stroke:#dc2626
class A,B p1
class C,D,E p2
class F p3
| Phase | Enhancements | Why now |
|---|---|---|
| P1 (foundation) | §A, §B | Orchestration + retrieval are the spine; everything else attaches to them. |
| P2 (sources + quality) | §C, §D, §E | Add structured + graph sources and the eval loop that proves quality. |
| P3 (governance) | §F | Wrap the now-real fabric in the regulated-grade governance story. |
§A — Port agent-queue topology onto LangGraph + add A2A handoff
Goal: make the "prod-grade LangGraph" claim literal while keeping the proven state model.
- New
packages/agent-graph: a typedStateGraphwith nodesroute → retrieve → grade → (rewrite) → generate → critique, conditional + cyclic edges, and a checkpointer backed byevent-store. - Keep
agent-queue's engine-selection idea as node-level model binding throughllm-router. - Expose each product agent with an A2A agent card (capabilities, auth scope, cost hints) so a supervisor agent can delegate; the card is served by
mcp-server.
stateDiagram-v2
[*] --> route
route --> retrieve: needs evidence
route --> generate: parametric/FAQ
retrieve --> grade
grade --> rewrite: low relevance (CRAG)
rewrite --> retrieve
grade --> generate: ok
generate --> critique
critique --> rewrite: ungrounded (Self-RAG)
critique --> [*]: grounded + cited
Acceptance: a LangGraph run with a forced low-relevance retrieval demonstrably loops
through rewrite; checkpoints land in event-store; one product reachable via A2A card.
Effort: M. Risk: low (mapping is 1:1 with today's state machine).
§B — Hybrid retrieval: pgvector + BM25 + rerank + HyDE / CRAG / Self-RAG
Goal: turn "I understand hybrid RAG" into a running services/rag-service.
- pgvector alongside the existing Postgres → one DB, one backup, transactional consistency with source rows; schema-per-tenant namespaces (mirrors
productId). - BM25 lexical (Postgres FTS or an OpenSearch sidecar) fused with vector via RRF.
- Cross-encoder rerank (bge-reranker or ColBERT late-interaction) on the fused candidates; context compression to fit budget.
- HyDE query rewriter node; CRAG relevance gate; Self-RAG groundedness critic (the §A nodes).
- Layout-aware ingestion in
extraction-service: PyMuPDF / Unstructured.io, OCR fallback, table preservation, page/section provenance on every chunk.
flowchart LR
Q --> HYDE[HyDE rewrite] --> EMB[embed]
EMB --> VEC[(pgvector ANN)]
Q --> BM[(BM25)]
VEC & BM --> RRF[RRF fuse] --> RR[cross-encoder rerank] --> CC[context compress] --> GEN
Acceptance: hybrid beats vector-only on a golden set (context-recall ↑, context-precision ↑); every chunk carries doc/page/section provenance; abstain fires when reranked top-score < τ. Effort: L. Risk: medium (reranker latency budget — mitigate with rerank-top-k only).
§C — Guarded Text-to-SQL tool
Goal: add genuine generative SQL for ad-hoc analytics without the foot-guns.
- Register a
sql-querytool onmcp-serverscoped to read-only semantic views (no base tables), with row-level security by tenant/role. - Schema-aware retrieval: embed table/column descriptions; retrieve only the relevant schema slice into the prompt (don't dump the catalog).
- Parse + validate generated SQL (allow-list of statements, forbid cross-schema joins, enforce
LIMIT); cost-cap and timeout. - Audit every generated query + row count to
event-store.
Acceptance: an attempt to read an unentitled column is blocked at the view/RLS layer and logged; a malformed/oversized query is rejected pre-execution. Effort: M. Risk: medium (this is the highest-leakage surface — keep it behind views).
§D — Cosmos Gremlin knowledge graph + Graph RAG
Goal: answer "connected-to" questions (KYC/AML-shaped) on infra we already run.
- Use the existing Azure Cosmos DB Gremlin API. Entity/relation extraction at ingest (from
extraction-serviceoutput + structured rows) builds the graph. - Graph-augmented retrieval: vector hit seeds an entry node → bounded Gremlin traversal returns the subgraph → fuse subgraph + text chunks into context.
- Expose a
graph-querytool onmcp-server(read-only, depth-bounded).
flowchart LR
Q --> V[(vector seed)] --> N[entry entity]
N --> G[(Gremlin traversal<br/>≤2 hops)]
G --> SUB[subgraph]
SUB --> FUSE[fuse w/ text chunks] --> GEN
Acceptance: a 2-hop relationship question that vector-only fails is answered correctly with the subgraph cited; traversal depth/time bounded. Effort: L. Risk: medium (graph modeling + traversal cost).
§E — Evaluation harness + factual-drift monitor in Hermes
Goal: make "RAGAS / faithfulness SLAs / drift monitoring" real and visible.
- Offline (CI): DeepEval pytest-style assertions on a golden set — faithfulness, answer-relevancy, context-precision, context-recall, answer-correctness. Regression below threshold blocks deploy.
- Online: sample production traces, score with RAGAS / LLM-as-judge, emit metrics via
telemetry-client. - Hermes pane: a "RAG Quality" panel (extends
hermes-ops) trending the five metrics per tenant + a drift alert when faithfulness/recall degrade week-over-week. - Wire abstain rate and escalation rate as first-class SLAs.
flowchart TB
subgraph CI["Offline / CI (DeepEval)"]
G[golden set] --> SC1[score] --> GATE{≥ SLA?}
GATE -- no --> BLOCK[block deploy]
GATE -- yes --> SHIP[ship]
end
subgraph PROD["Online (RAGAS / judge)"]
TR[sampled traces] --> SC2[score] --> TEL[telemetry-client] --> HERMES[Hermes RAG-Quality pane]
HERMES --> DRIFT{drift?} -- yes --> ALERT[alert + open finding]
end
Acceptance: a deliberately-degraded retriever fails the CI gate; the Hermes pane shows the five metrics per tenant and fires a drift alert on a seeded regression. Effort: M. Risk: low-medium (judge cost — sample, don't score 100%).
§F — Model-card registry + governance pack
Goal: the regulated-grade documentation/audit layer (SR 11-7 / EU AI Act ready).
- Model-card registry (a
governancepackage + Hermes pane): per deployed model/agent — purpose, data sources, eval scores, known limits, owner, last-reviewed date, kill-switch link. - Decision log: every generation's (query, retrieved sources, model, faithfulness score, abstain/answer) to
event-store→ reproducible audit trail. - RACI doc template per engagement; ADR set under
docs/adr/for each architectural choice. - Map controls to SR 11-7 (model inventory, validation, monitoring, change control) and EU AI Act (risk classification, logging, human oversight, transparency) — see
05-banking-blueprints.md.
Acceptance: every production model has a card with current eval scores + owner; any
answer can be reconstructed from the decision log; controls trace to named regulatory clauses.
Effort: M. Risk: low (mostly assembly over existing event-store/flags/auth).
Sequencing & "what I'd do in the first 90 days" (great closing answer)
gantt
title Agentic-RAG hardening — 90-day view
dateFormat X
axisFormat %s
section Foundation
§A LangGraph + A2A :a, 0, 3
§B Hybrid retrieval :b, 1, 5
section Sources & Quality
§C Guarded Text-to-SQL :c, 5, 3
§D Graph RAG (Gremlin) :d, 5, 4
§E Eval harness + drift :e, 4, 4
section Governance
§F Model cards + RACI :f, 8, 3
"In 90 days I'd stand up the retrieval spine and the eval harness first — because you can't tune what you can't measure — then layer structured + graph sources, and close with the governance pack so the whole thing is audit-ready. Notice governance isn't last because it's least important; it's last because by then it's mostly assembling controls the platform already enforces (auth, masking, kill-switch, audit) into cards and RACI."