learning_ai_common_plat/docs/EXTRACTION_SERVICE_ROADMAP.md

498 lines
25 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Extraction Service — Roadmap & Task Checklist
> **Service:** `@lysnrai/extraction-service` (port 4005)
> **Package:** `@bytelyst/extraction` (shared types + client)
> **Core dependency:** [google/langextract](https://github.com/google/langextract) (Python)
>
> **Companion docs:** [ECOSYSTEM_ARCHITECTURE.md](./ECOSYSTEM_ARCHITECTURE.md) · [ROADMAP.md](./ROADMAP.md)
---
## Overview
A shared extraction microservice that uses Google's LangExtract library to extract structured information from unstructured text. Both LysnrAI and MindLyst consume this service for their respective extraction needs.
**Architecture:** Fastify (routing, auth, validation, request tracing) + Python sidecar (LangExtract). The Fastify layer keeps the service consistent with the other 4 services. The Python process handles the actual LLM-powered extraction.
```
┌──────────────────────────────────────────────────────────┐
│ extraction-service │
│ (port 4005) │
│ │
│ ┌─────────────────────┐ ┌──────────────────────────┐ │
│ │ Fastify (TS) │ │ Python Sidecar │ │
│ │ │ │ │ │
│ │ - Auth middleware │──►│ - LangExtract wrapper │ │
│ │ - Zod validation │◄──│ - Task registry │ │
│ │ - x-request-id │ │ - Model provider config │ │
│ │ - Rate limiting │ │ - Result caching │ │
│ │ - /health │ │ │ │
│ └─────────────────────┘ └──────────────────────────┘ │
└──────────────────────────────────────────────────────────┘
▲ ▲
│ │
REST API FastAPI (internal :4006)
(external) or subprocess stdio
```
### Consumers
| Product | Use Case | Entry Point |
| ----------------------------- | ------------------------------------------------------------------------ | -------------------------------------------------- |
| **LysnrAI** — Desktop/Backend | Post-transcription extraction (action items, decisions, dates, people) | `backend/src/clients/extraction_client.py` |
| **LysnrAI** — Admin Dashboard | Transcript analytics, entity review | `admin-dashboard-web/src/lib/extraction-client.ts` |
| **MindLyst** — KMP/Web | Triage pipeline (brain routing, entity extraction, topic classification) | `mindlyst-native/web/src/pages/api/triage.ts` |
| **MindLyst** — Web Dashboard | Brain insight generation, reflection enrichment | Direct API calls via `@bytelyst/api-client` |
---
## Phase 0 — Foundation & Scaffolding
> **Goal:** Set up the service skeleton, Python environment, and build pipeline.
### Service scaffold (Fastify)
- [x] **0.1** Create `services/extraction-service/` directory structure: [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5)
```
services/extraction-service/
src/
lib/
config.ts # Zod config schema (PORT, HOST, CORS, PYTHON_SIDECAR_URL, etc.)
errors.ts # Re-export from @bytelyst/errors
cosmos.ts # Re-export from @bytelyst/cosmos (for task registry persistence)
product-config.ts # Re-export from @bytelyst/config
python-bridge.ts # HTTP client to Python sidecar
modules/
extract/
types.ts # Zod schemas: ExtractionTask, ExtractionExample, ExtractionResult
routes.ts # POST /api/extract, POST /api/extract/batch, GET /api/tasks
tasks/
types.ts # Predefined task definitions (triage, transcript, etc.)
repository.ts # Cosmos CRUD for custom task definitions
routes.ts # CRUD endpoints for task management
server.ts # createServiceApp + route registration
package.json
tsconfig.json
Dockerfile
```
- [x] **0.2** Create `package.json` (`@lysnrai/extraction-service`, port 4005) matching existing service conventions [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5)
- [x] **0.3** Create `tsconfig.json` (self-contained, matching tracker-service pattern) [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5)
- [x] **0.4** Create `src/lib/config.ts` with Zod schema (PORT, HOST, NODE*ENV, CORS_ORIGIN, SERVICE_NAME, PYTHON_SIDECAR_URL, DEFAULT_MODEL_ID, COSMOS*\*, JWT_SECRET, DEFAULT_PRODUCT_ID) [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5)
- [x] **0.5** Create `src/server.ts` using `createServiceApp()` + `startService()` from `@bytelyst/fastify-core` [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5)
- [x] **0.6** Add `.env.example` with all required env vars [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5)
- [x] **0.7** Verify: `pnpm build` passes for the new service [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5)
### Python sidecar scaffold
- [x] **0.8** Create `services/extraction-service/python/` directory: [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5)
```
python/
src/
__init__.py
app.py # FastAPI app (internal, port 4006)
extractor.py # LangExtract wrapper
task_registry.py # Built-in task definitions
models.py # Pydantic models matching TS Zod schemas
requirements.txt # langextract, fastapi, uvicorn, pydantic
Dockerfile # Python 3.12 slim
```
- [x] **0.9** Create `python/requirements.txt`: [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5)
```
langextract>=0.3.0
fastapi>=0.115.0
uvicorn>=0.34.0
pydantic>=2.10.0
pydantic-settings>=2.7.0
structlog>=24.4.0
```
- [x] **0.10** Create `python/src/app.py` — FastAPI app with POST /extract, POST /extract/batch, GET /health [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5)
- [x] **0.11** Create `python/src/extractor.py` — wrapper around `lx.extract()` with mock fallback [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5)
- [ ] **0.12** Verify: Python sidecar starts and `/health` returns OK (requires `pip install` — deferred to Phase 1)
### Package scaffold (`@bytelyst/extraction`)
- [x] **0.13** Create `packages/extraction/` directory: [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5)
```
packages/extraction/
src/
index.ts # Public API
types.ts # Shared TypeScript types
client.ts # createExtractionClient() factory
package.json
tsconfig.json
```
- [x] **0.14** Create `package.json` (`@bytelyst/extraction`) with `@bytelyst/api-client` as peer dep [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5)
- [x] **0.15** Define TypeScript types (ExtractionTask, ExtractionExample, ExtractionEntity, ExtractRequest, ExtractResponse, BatchExtractRequest, BatchExtractResponse) [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5)
- [x] **0.16** Create `createExtractionClient()` factory using `createApiClient()` pattern [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5)
- [x] **0.17** Verify: `pnpm build` passes for the new package [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5)
### Workspace wiring
- [x] **0.18** Verify `extraction-service` and `extraction` covered by `packages/*` + `services/*` globs in `pnpm-workspace.yaml` [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5)
- [x] **0.19** Run `pnpm install` from repo root — workspace resolution verified [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5)
- [x] **0.20** Verify: `pnpm build` passes for both extraction-service and @bytelyst/extraction [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5)
---
## Phase 1 — Core Extraction API
> **Goal:** Working extraction endpoint that accepts text + task definition and returns structured results via LangExtract.
### Python extractor implementation
- [ ] **1.1** Implement `extractor.py`:
- Accept task definition (prompt, examples, model config)
- Accept input text (string or URL)
- Call `lx.extract()` with configurable parameters (model_id, extraction_passes, max_workers, max_char_buffer)
- Return structured results with source grounding (extraction_class, extraction_text, attributes, char offsets)
- Handle errors gracefully (model timeout, rate limit, invalid input)
- [ ] **1.2** Implement model provider configuration:
- Gemini (default): API key from env
- Azure OpenAI: endpoint + key from env
- Ollama (local dev): configurable base URL
- [ ] **1.3** Add request/response logging via `structlog` (never `print()`)
- [ ] **1.4** Add request timeout configuration (default 120s for long documents)
### Fastify routes
- [ ] **1.5** Implement `src/modules/extract/types.ts`:
- `ExtractRequestSchema` (Zod) — task definition + input text + options
- `ExtractResponseSchema` (Zod) — array of extractions + metadata
- `BatchExtractRequestSchema` — array of inputs + shared task
- [ ] **1.6** Implement `src/modules/extract/routes.ts`:
- `POST /api/extract` — auth required, validates input, proxies to Python sidecar
- `POST /api/extract/batch` — auth required, accepts multiple inputs
- `GET /api/extract/models` — list available model providers
- [ ] **1.7** Implement `src/lib/python-bridge.ts`:
- HTTP client to Python sidecar (fetch with timeout, retry, error mapping)
- Health check polling on startup (wait for sidecar readiness)
- Request ID forwarding (`x-request-id`)
- [ ] **1.8** Add rate limiting to extraction endpoints (configurable per-user limit)
### Tests
- [ ] **1.9** Write unit tests for Zod schemas (`types.test.ts`)
- [ ] **1.10** Write integration tests for extract routes (mock Python sidecar responses)
- [ ] **1.11** Write Python unit tests for `extractor.py` (mock `lx.extract`)
- [ ] **1.12** Verify: `pnpm test` passes, `pytest` passes
---
## Phase 2 — Predefined Task Library
> **Goal:** Ship a curated set of extraction task definitions that LysnrAI and MindLyst can use out-of-the-box.
### Task definitions
- [ ] **2.1** Define `transcript-extraction` task:
- Classes: `action_item`, `decision`, `question`, `deadline`, `person`, `topic`
- 35 few-shot examples from realistic meeting transcripts
- Default model: `gemini-2.5-flash`
- [ ] **2.2** Define `triage` task (MindLyst):
- Classes: `topic`, `entity`, `action`, `emotion`, `date_reference`, `brain_signal`
- brain_signal attributes: `{ brain: "work|home|money|health|global", confidence: float }`
- 35 few-shot examples per brain type
- [ ] **2.3** Define `memory-insight` task (MindLyst):
- Classes: `pattern`, `recurring_theme`, `relationship`, `milestone`
- Examples from accumulated brain memories
- [ ] **2.4** Define `reflection-enrichment` task (MindLyst):
- Classes: `emotional_state`, `accomplishment`, `concern`, `goal_progress`
- Examples from journal-style text
- [ ] **2.5** Define `bug-report-extraction` task (Tracker):
- Classes: `steps_to_reproduce`, `expected_behavior`, `actual_behavior`, `affected_component`, `severity`
- Examples from real issue submissions
### Task registry (Cosmos DB)
- [ ] **2.6** Create Cosmos container: `extraction_tasks` (partition key: `/productId`)
- [ ] **2.7** Implement `src/modules/tasks/repository.ts` — CRUD for task definitions
- [ ] **2.8** Implement `src/modules/tasks/routes.ts`:
- `GET /api/tasks` — list all tasks (built-in + custom)
- `GET /api/tasks/:id` — get task by ID
- `POST /api/tasks` — create custom task (admin only)
- `PUT /api/tasks/:id` — update task (admin only)
- `DELETE /api/tasks/:id` — delete custom task (admin only)
- [ ] **2.9** Seed built-in tasks on service startup (idempotent upsert)
- [ ] **2.10** Add `productId` to all task documents
### Python task registry
- [ ] **2.11** Implement `task_registry.py` — load task definitions from Cosmos (via Fastify API) or local JSON fallback
- [ ] **2.12** Create `python/tasks/` directory with JSON files for each built-in task
- [ ] **2.13** Add task validation: verify examples follow LangExtract best practices (ordered, verbatim, no overlap)
### Tests
- [ ] **2.14** Write tests for task CRUD routes
- [ ] **2.15** Write tests for task seeding logic
- [ ] **2.16** Verify: all tests pass
---
## Phase 3 — Consumer Integration
> **Goal:** Wire LysnrAI and MindLyst to call the extraction service.
### `@bytelyst/extraction` package finalization
- [ ] **3.1** Add typed methods to `createExtractionClient()`:
- `extract(input, taskId, options?)` — single extraction
- `extractBatch(inputs, taskId, options?)` — batch extraction
- `listTasks()` — get available tasks
- `getTask(id)` — get task details
- [ ] **3.2** Export all types from `src/index.ts`
- [ ] **3.3** Publish: `pnpm build` in `packages/extraction/`
### LysnrAI integration
- [ ] **3.4** Add `@bytelyst/extraction` to `admin-dashboard-web/package.json` (via `file:` ref)
- [ ] **3.5** Create `admin-dashboard-web/src/lib/extraction-client.ts` — typed client instance
- [ ] **3.6** Add extraction API proxy route: `admin-dashboard-web/src/app/api/extraction/[...path]/route.ts`
- [ ] **3.7** Create Python extraction client in `backend/src/clients/extraction_client.py`:
- HTTP client to extraction-service (port 4005)
- Methods: `extract_transcript(text)`, `extract_batch(texts)`
- [ ] **3.8** Add post-transcription extraction to LysnrAI backend:
- New endpoint: `POST /api/transcripts/{id}/extract`
- Calls extraction-service with `transcript-extraction` task
- Stores results alongside transcript
- [ ] **3.9** Add extraction results display to admin dashboard (transcript detail page)
### MindLyst integration
- [ ] **3.10** Add `@bytelyst/extraction` to `mindlyst-native/web/package.json` (via `file:` ref):
```
"@bytelyst/extraction": "file:../../../learning_ai_common_plat/packages/extraction"
```
- [ ] **3.11** Create `mindlyst-native/web/src/lib/extraction-client.ts`
- [ ] **3.12** Create API route: `mindlyst-native/web/src/pages/api/extract.ts`
- Accepts raw capture text, calls extraction-service with `triage` task
- Returns brain routing + extracted entities
- [ ] **3.13** Update triage flow on web dashboard to use extraction results for brain auto-routing
- [ ] **3.14** Wire brain insight generation to use `memory-insight` task
- [ ] **3.15** Wire reflection enrichment to use `reflection-enrichment` task
### Tests
- [ ] **3.16** Add integration tests for LysnrAI extraction endpoint
- [ ] **3.17** Add integration tests for MindLyst triage-via-extraction flow
- [ ] **3.18** Verify: `npx tsc --noEmit` passes in all 3 dashboards + MindLyst web
---
## Phase 4 — Docker & DevOps
> **Goal:** Containerize, add to docker-compose, update run scripts.
### Dockerfile
- [ ] **4.1** Create multi-stage `Dockerfile` for extraction-service:
- Stage 1: Node.js build (Fastify TS → JS)
- Stage 2: Python setup (install langextract + deps)
- Stage 3: Runtime (Node.js + Python, supervisord to run both processes)
- [ ] **4.2** Create `supervisord.conf` to manage Fastify (port 4005) + Python sidecar (port 4006)
- [ ] **4.3** Verify: `docker build` succeeds
### Docker Compose
- [ ] **4.4** Add `extraction-service` to `docker-compose.yml`:
```yaml
extraction-service:
build:
context: .
dockerfile: services/extraction-service/Dockerfile
ports:
- '4005:4005'
env_file:
- .env
environment:
- PORT=4005
- PYTHON_SIDECAR_URL=http://localhost:4006
labels:
- 'traefik.enable=true'
- 'traefik.http.routers.extraction.rule=PathPrefix(`/api/extract`) || PathPrefix(`/api/tasks`)'
- 'traefik.http.services.extraction.loadbalancer.server.port=4005'
logging:
driver: loki
options:
loki-url: 'http://host.docker.internal:3100/loki/api/v1/push'
loki-retries: '3'
restart: unless-stopped
healthcheck:
test: ['CMD', 'wget', '--no-verbose', '--tries=1', '--spider', 'http://localhost:4005/health']
interval: 30s
timeout: 10s
retries: 3
```
- [ ] **4.5** Add to LysnrAI `docker-compose.yml` (references `../learning_ai_common_plat/services/extraction-service/`)
### Run scripts
- [ ] **4.6** Add extraction-service to `run-local-all-services.sh` in LysnrAI repo
- [ ] **4.7** Add extraction-service to `.windsurf/workflows/start-all-services.md`
- [ ] **4.8** Add `.env.example` entries to LysnrAI repo root (`EXTRACTION_SERVICE_URL=http://localhost:4005`)
- [ ] **4.9** Add `.env.example` entries to MindLyst web (same)
### CI
- [ ] **4.10** Create `.github/workflows/ci-extraction-service.yml`:
- Trigger: push to `services/extraction-service/**` or `packages/extraction/**`
- Steps: pnpm install, pnpm build, pnpm test (TS), pip install + pytest (Python)
- [ ] **4.11** Verify: CI workflow passes
---
## Phase 5 — Production Hardening
> **Goal:** Rate limiting, caching, observability, cost controls.
### Caching
- [ ] **5.1** Add result caching in Python sidecar:
- Cache key: hash(task_id + input_text + model_id)
- TTL: configurable (default 24h)
- Storage: in-memory LRU (dev) or Redis (prod)
- [ ] **5.2** Add cache hit/miss headers to Fastify response (`X-Extraction-Cache: HIT/MISS`)
### Cost controls
- [ ] **5.3** Add per-user daily extraction quota (configurable per plan tier):
- Free: 10 extractions/day
- Pro: 100 extractions/day
- Enterprise: unlimited
- [ ] **5.4** Track usage in Cosmos `extraction_usage` container (partition: `/userId`)
- [ ] **5.5** Return `429 Too Many Requests` with quota info when exceeded
- [ ] **5.6** Add usage reporting endpoint: `GET /api/extract/usage` (admin)
### Observability
- [ ] **5.7** Add structured logging for every extraction:
- Request: task_id, input_length, model_id, user_id, product_id
- Response: entity_count, duration_ms, token_count, cache_hit
- [ ] **5.8** Add Prometheus metrics (via `fastify-metrics`):
- `extraction_requests_total` (labels: task_id, model_id, product_id, status)
- `extraction_duration_seconds` (histogram)
- `extraction_entities_extracted` (histogram)
- `extraction_cache_hit_total`
- [ ] **5.9** Add Grafana dashboard for extraction service (in `services/monitoring/grafana/dashboards/`)
### Error handling
- [ ] **5.10** Map LangExtract errors to `@bytelyst/errors`:
- Model timeout → `408 Request Timeout`
- Rate limit (upstream LLM) → `429 Too Many Requests` with retry-after
- Invalid task definition → `400 Bad Request`
- Model unavailable → `503 Service Unavailable`
- [ ] **5.11** Add circuit breaker for Python sidecar (fail fast if sidecar is down)
- [ ] **5.12** Add graceful degradation: return partial results if some chunks fail
---
## Phase 6 — Advanced Features (Future)
> **Goal:** Power-user features, visualization, and batch processing.
### Visualization
- [ ] **6.1** Expose LangExtract's HTML visualization:
- `GET /api/extract/:requestId/visualization` — returns interactive HTML
- Embed in admin dashboard for extraction quality review
- [ ] **6.2** Store visualization artifacts in Azure Blob Storage (`extractions` container)
### Batch & async processing
- [ ] **6.3** Add async extraction endpoint:
- `POST /api/extract/async` — returns job ID immediately
- `GET /api/extract/jobs/:id` — poll for status + results
- Webhook callback when complete
- [ ] **6.4** Add Vertex AI batch processing support (for high-volume MindLyst triage)
### Custom model support
- [ ] **6.5** Add Ollama provider for local/air-gapped deployments
- [ ] **6.6** Add model benchmarking endpoint: run same task across models, compare quality + cost
### Multi-language extraction
- [ ] **6.7** Test and validate extraction across languages (LangExtract supports multi-language via LLM)
- [ ] **6.8** Add language detection to extraction pipeline (auto-detect input language)
---
## Env Vars Summary
| Variable | Service | Default | Description |
| ------------------------ | ------------------ | ----------------------- | ----------------------------------- |
| `PORT` | extraction-service | `4005` | Fastify listen port |
| `HOST` | extraction-service | `0.0.0.0` | Fastify listen host |
| `CORS_ORIGIN` | extraction-service | `*` | Allowed origins |
| `PYTHON_SIDECAR_URL` | extraction-service | `http://localhost:4006` | Python sidecar URL |
| `DEFAULT_MODEL_ID` | extraction-service | `gemini-2.5-flash` | Default LLM model |
| `GEMINI_API_KEY` | python sidecar | — | Google Gemini API key |
| `AZURE_OPENAI_API_KEY` | python sidecar | — | Azure OpenAI key (alternative) |
| `AZURE_OPENAI_ENDPOINT` | python sidecar | — | Azure OpenAI endpoint (alternative) |
| `MAX_WORKERS` | python sidecar | `10` | Parallel extraction workers |
| `MAX_CHAR_BUFFER` | python sidecar | `2000` | Chunk size for long docs |
| `EXTRACTION_CACHE_TTL` | python sidecar | `86400` | Cache TTL in seconds |
| `COSMOS_ENDPOINT` | extraction-service | — | Azure Cosmos DB endpoint |
| `COSMOS_KEY` | extraction-service | — | Azure Cosmos DB key |
| `COSMOS_DATABASE` | extraction-service | `lysnrai` | Database name |
| `JWT_SECRET` | extraction-service | — | JWT validation secret |
| `EXTRACTION_SERVICE_URL` | consumers | `http://localhost:4005` | Used by dashboards/backends |
---
## Port Allocation
| Service | Port |
| -------------------------------------------- | -------- |
| growth-service | 4001 |
| billing-service | 4002 |
| platform-service | 4003 |
| tracker-service | 4004 |
| **extraction-service** | **4005** |
| extraction-service python sidecar (internal) | 4006 |
---
## Dependency Graph
```
@bytelyst/extraction (package)
└── @bytelyst/api-client (peer dep)
@lysnrai/extraction-service (service)
├── @bytelyst/fastify-core
├── @bytelyst/auth
├── @bytelyst/config
├── @bytelyst/cosmos
├── @bytelyst/errors
├── fastify, zod, jose (direct deps)
└── python sidecar
└── langextract, fastapi, uvicorn, structlog
```
---
## Estimated Effort
| Phase | Effort | Dependencies |
| ------------------------------ | -------- | ------------ |
| Phase 0 — Foundation | 23 days | None |
| Phase 1 — Core API | 23 days | Phase 0 |
| Phase 2 — Task Library | 2 days | Phase 1 |
| Phase 3 — Consumer Integration | 34 days | Phase 2 |
| Phase 4 — Docker & DevOps | 12 days | Phase 1 |
| Phase 5 — Production Hardening | 23 days | Phase 3 |
| Phase 6 — Advanced (future) | Ongoing | Phase 5 |
**Total MVP (Phases 04): ~1014 days**
---
## Rollback Strategy
- The extraction-service is **additive** — no existing code is modified until Phase 3
- Phase 3 consumer integration uses new endpoints/routes — existing triage/transcript flows remain untouched
- If extraction-service is down, consumers fall back to their existing behavior (MindLyst mock triage, LysnrAI raw transcripts)
- The `@bytelyst/extraction` package is optional — dashboards only import it for new extraction features