diff --git a/docs/EXTRACTION_SERVICE_ROADMAP.md b/docs/EXTRACTION_SERVICE_ROADMAP.md new file mode 100644 index 00000000..ad47e693 --- /dev/null +++ b/docs/EXTRACTION_SERVICE_ROADMAP.md @@ -0,0 +1,511 @@ +# Extraction Service — Roadmap & Task Checklist + +> **Service:** `@lysnrai/extraction-service` (port 4005) +> **Package:** `@bytelyst/extraction` (shared types + client) +> **Core dependency:** [google/langextract](https://github.com/google/langextract) (Python) +> +> **Companion docs:** [ECOSYSTEM_ARCHITECTURE.md](./ECOSYSTEM_ARCHITECTURE.md) · [ROADMAP.md](./ROADMAP.md) + +--- + +## Overview + +A shared extraction microservice that uses Google's LangExtract library to extract structured information from unstructured text. Both LysnrAI and MindLyst consume this service for their respective extraction needs. + +**Architecture:** Fastify (routing, auth, validation, request tracing) + Python sidecar (LangExtract). The Fastify layer keeps the service consistent with the other 4 services. The Python process handles the actual LLM-powered extraction. + +``` +┌──────────────────────────────────────────────────────────┐ +│ extraction-service │ +│ (port 4005) │ +│ │ +│ ┌─────────────────────┐ ┌──────────────────────────┐ │ +│ │ Fastify (TS) │ │ Python Sidecar │ │ +│ │ │ │ │ │ +│ │ - Auth middleware │──►│ - LangExtract wrapper │ │ +│ │ - Zod validation │◄──│ - Task registry │ │ +│ │ - x-request-id │ │ - Model provider config │ │ +│ │ - Rate limiting │ │ - Result caching │ │ +│ │ - /health │ │ │ │ +│ └─────────────────────┘ └──────────────────────────┘ │ +└──────────────────────────────────────────────────────────┘ + ▲ ▲ + │ │ + REST API FastAPI (internal :4006) + (external) or subprocess stdio +``` + +### Consumers + +| Product | Use Case | Entry Point | +| ----------------------------- | ------------------------------------------------------------------------ | -------------------------------------------------- | +| **LysnrAI** — Desktop/Backend | Post-transcription extraction (action items, decisions, dates, people) | `backend/src/clients/extraction_client.py` | +| **LysnrAI** — Admin Dashboard | Transcript analytics, entity review | `admin-dashboard-web/src/lib/extraction-client.ts` | +| **MindLyst** — KMP/Web | Triage pipeline (brain routing, entity extraction, topic classification) | `mindlyst-native/web/src/pages/api/triage.ts` | +| **MindLyst** — Web Dashboard | Brain insight generation, reflection enrichment | Direct API calls via `@bytelyst/api-client` | + +--- + +## Phase 0 — Foundation & Scaffolding + +> **Goal:** Set up the service skeleton, Python environment, and build pipeline. + +### Service scaffold (Fastify) + +- [ ] **0.1** Create `services/extraction-service/` directory structure: + ``` + services/extraction-service/ + src/ + lib/ + config.ts # Zod config schema (PORT, HOST, CORS, PYTHON_SIDECAR_URL, etc.) + errors.ts # Re-export from @bytelyst/errors + cosmos.ts # Re-export from @bytelyst/cosmos (for task registry persistence) + product-config.ts # Re-export from @bytelyst/config + python-bridge.ts # HTTP client to Python sidecar + modules/ + extract/ + types.ts # Zod schemas: ExtractionTask, ExtractionExample, ExtractionResult + routes.ts # POST /api/extract, POST /api/extract/batch, GET /api/tasks + tasks/ + types.ts # Predefined task definitions (triage, transcript, etc.) + repository.ts # Cosmos CRUD for custom task definitions + routes.ts # CRUD endpoints for task management + server.ts # createServiceApp + route registration + package.json + tsconfig.json + Dockerfile + ``` +- [ ] **0.2** Create `package.json` (`@lysnrai/extraction-service`, port 4005) matching existing service conventions +- [ ] **0.3** Create `tsconfig.json` extending `../../tsconfig.base.json` +- [ ] **0.4** Create `src/lib/config.ts` with Zod schema: + - `PORT` (default 4005), `HOST`, `CORS_ORIGIN` + - `PYTHON_SIDECAR_URL` (default `http://localhost:4006`) + - `DEFAULT_MODEL_ID` (default `gemini-2.5-flash`) + - `GEMINI_API_KEY` or `AZURE_OPENAI_API_KEY` / `AZURE_OPENAI_ENDPOINT` + - `MAX_WORKERS` (default 10), `MAX_CHAR_BUFFER` (default 2000) + - `COSMOS_ENDPOINT`, `COSMOS_KEY`, `COSMOS_DATABASE`, `JWT_SECRET` +- [ ] **0.5** Create `src/server.ts` using `createServiceApp()` + `startService()` from `@bytelyst/fastify-core` +- [ ] **0.6** Add `.env.example` with all required env vars +- [ ] **0.7** Verify: `pnpm build` passes for the new service + +### Python sidecar scaffold + +- [ ] **0.8** Create `services/extraction-service/python/` directory: + ``` + python/ + src/ + __init__.py + app.py # FastAPI app (internal, port 4006) + extractor.py # LangExtract wrapper + task_registry.py # Built-in task definitions + models.py # Pydantic models matching TS Zod schemas + requirements.txt # langextract, fastapi, uvicorn, pydantic + Dockerfile # Python 3.12 slim + ``` +- [ ] **0.9** Create `python/requirements.txt`: + ``` + langextract>=0.3.0 + fastapi>=0.115.0 + uvicorn>=0.34.0 + pydantic>=2.10.0 + pydantic-settings>=2.7.0 + structlog>=24.4.0 + ``` +- [ ] **0.10** Create `python/src/app.py` — FastAPI app with endpoints: + - `POST /extract` — single document extraction + - `POST /extract/batch` — batch extraction + - `GET /health` — sidecar health check +- [ ] **0.11** Create `python/src/extractor.py` — wrapper around `lx.extract()` with configurable model provider +- [ ] **0.12** Verify: Python sidecar starts and `/health` returns OK + +### Package scaffold (`@bytelyst/extraction`) + +- [ ] **0.13** Create `packages/extraction/` directory: + ``` + packages/extraction/ + src/ + index.ts # Public API + types.ts # Shared TypeScript types + client.ts # createExtractionClient() factory + package.json + tsconfig.json + ``` +- [ ] **0.14** Create `package.json` (`@bytelyst/extraction`) with `@bytelyst/api-client` as peer dep +- [ ] **0.15** Define TypeScript types matching the extraction API: + - `ExtractionTask` — prompt description + examples + model config + - `ExtractionExample` — text + extractions (class, text, attributes) + - `ExtractionResult` — extracted entities with source grounding + - `ExtractionRequest` — task + input text/URL + - `ExtractionResponse` — results + metadata (model, duration, token count) +- [ ] **0.16** Create `createExtractionClient()` factory using `createApiClient()` pattern +- [ ] **0.17** Verify: `pnpm build` passes for the new package + +### Workspace wiring + +- [ ] **0.18** Add `extraction-service` and `extraction` to `pnpm-workspace.yaml` (already covered by `packages/*` + `services/*` globs — verify) +- [ ] **0.19** Run `pnpm install` from repo root — verify workspace resolution +- [ ] **0.20** Verify: `pnpm build` and `pnpm typecheck` pass across entire repo + +--- + +## Phase 1 — Core Extraction API + +> **Goal:** Working extraction endpoint that accepts text + task definition and returns structured results via LangExtract. + +### Python extractor implementation + +- [ ] **1.1** Implement `extractor.py`: + - Accept task definition (prompt, examples, model config) + - Accept input text (string or URL) + - Call `lx.extract()` with configurable parameters (model_id, extraction_passes, max_workers, max_char_buffer) + - Return structured results with source grounding (extraction_class, extraction_text, attributes, char offsets) + - Handle errors gracefully (model timeout, rate limit, invalid input) +- [ ] **1.2** Implement model provider configuration: + - Gemini (default): API key from env + - Azure OpenAI: endpoint + key from env + - Ollama (local dev): configurable base URL +- [ ] **1.3** Add request/response logging via `structlog` (never `print()`) +- [ ] **1.4** Add request timeout configuration (default 120s for long documents) + +### Fastify routes + +- [ ] **1.5** Implement `src/modules/extract/types.ts`: + - `ExtractRequestSchema` (Zod) — task definition + input text + options + - `ExtractResponseSchema` (Zod) — array of extractions + metadata + - `BatchExtractRequestSchema` — array of inputs + shared task +- [ ] **1.6** Implement `src/modules/extract/routes.ts`: + - `POST /api/extract` — auth required, validates input, proxies to Python sidecar + - `POST /api/extract/batch` — auth required, accepts multiple inputs + - `GET /api/extract/models` — list available model providers +- [ ] **1.7** Implement `src/lib/python-bridge.ts`: + - HTTP client to Python sidecar (fetch with timeout, retry, error mapping) + - Health check polling on startup (wait for sidecar readiness) + - Request ID forwarding (`x-request-id`) +- [ ] **1.8** Add rate limiting to extraction endpoints (configurable per-user limit) + +### Tests + +- [ ] **1.9** Write unit tests for Zod schemas (`types.test.ts`) +- [ ] **1.10** Write integration tests for extract routes (mock Python sidecar responses) +- [ ] **1.11** Write Python unit tests for `extractor.py` (mock `lx.extract`) +- [ ] **1.12** Verify: `pnpm test` passes, `pytest` passes + +--- + +## Phase 2 — Predefined Task Library + +> **Goal:** Ship a curated set of extraction task definitions that LysnrAI and MindLyst can use out-of-the-box. + +### Task definitions + +- [ ] **2.1** Define `transcript-extraction` task: + - Classes: `action_item`, `decision`, `question`, `deadline`, `person`, `topic` + - 3–5 few-shot examples from realistic meeting transcripts + - Default model: `gemini-2.5-flash` +- [ ] **2.2** Define `triage` task (MindLyst): + - Classes: `topic`, `entity`, `action`, `emotion`, `date_reference`, `brain_signal` + - brain_signal attributes: `{ brain: "work|home|money|health|global", confidence: float }` + - 3–5 few-shot examples per brain type +- [ ] **2.3** Define `memory-insight` task (MindLyst): + - Classes: `pattern`, `recurring_theme`, `relationship`, `milestone` + - Examples from accumulated brain memories +- [ ] **2.4** Define `reflection-enrichment` task (MindLyst): + - Classes: `emotional_state`, `accomplishment`, `concern`, `goal_progress` + - Examples from journal-style text +- [ ] **2.5** Define `bug-report-extraction` task (Tracker): + - Classes: `steps_to_reproduce`, `expected_behavior`, `actual_behavior`, `affected_component`, `severity` + - Examples from real issue submissions + +### Task registry (Cosmos DB) + +- [ ] **2.6** Create Cosmos container: `extraction_tasks` (partition key: `/productId`) +- [ ] **2.7** Implement `src/modules/tasks/repository.ts` — CRUD for task definitions +- [ ] **2.8** Implement `src/modules/tasks/routes.ts`: + - `GET /api/tasks` — list all tasks (built-in + custom) + - `GET /api/tasks/:id` — get task by ID + - `POST /api/tasks` — create custom task (admin only) + - `PUT /api/tasks/:id` — update task (admin only) + - `DELETE /api/tasks/:id` — delete custom task (admin only) +- [ ] **2.9** Seed built-in tasks on service startup (idempotent upsert) +- [ ] **2.10** Add `productId` to all task documents + +### Python task registry + +- [ ] **2.11** Implement `task_registry.py` — load task definitions from Cosmos (via Fastify API) or local JSON fallback +- [ ] **2.12** Create `python/tasks/` directory with JSON files for each built-in task +- [ ] **2.13** Add task validation: verify examples follow LangExtract best practices (ordered, verbatim, no overlap) + +### Tests + +- [ ] **2.14** Write tests for task CRUD routes +- [ ] **2.15** Write tests for task seeding logic +- [ ] **2.16** Verify: all tests pass + +--- + +## Phase 3 — Consumer Integration + +> **Goal:** Wire LysnrAI and MindLyst to call the extraction service. + +### `@bytelyst/extraction` package finalization + +- [ ] **3.1** Add typed methods to `createExtractionClient()`: + - `extract(input, taskId, options?)` — single extraction + - `extractBatch(inputs, taskId, options?)` — batch extraction + - `listTasks()` — get available tasks + - `getTask(id)` — get task details +- [ ] **3.2** Export all types from `src/index.ts` +- [ ] **3.3** Publish: `pnpm build` in `packages/extraction/` + +### LysnrAI integration + +- [ ] **3.4** Add `@bytelyst/extraction` to `admin-dashboard-web/package.json` (via `file:` ref) +- [ ] **3.5** Create `admin-dashboard-web/src/lib/extraction-client.ts` — typed client instance +- [ ] **3.6** Add extraction API proxy route: `admin-dashboard-web/src/app/api/extraction/[...path]/route.ts` +- [ ] **3.7** Create Python extraction client in `backend/src/clients/extraction_client.py`: + - HTTP client to extraction-service (port 4005) + - Methods: `extract_transcript(text)`, `extract_batch(texts)` +- [ ] **3.8** Add post-transcription extraction to LysnrAI backend: + - New endpoint: `POST /api/transcripts/{id}/extract` + - Calls extraction-service with `transcript-extraction` task + - Stores results alongside transcript +- [ ] **3.9** Add extraction results display to admin dashboard (transcript detail page) + +### MindLyst integration + +- [ ] **3.10** Add `@bytelyst/extraction` to `mindlyst-native/web/package.json` (via `file:` ref): + ``` + "@bytelyst/extraction": "file:../../../learning_ai_common_plat/packages/extraction" + ``` +- [ ] **3.11** Create `mindlyst-native/web/src/lib/extraction-client.ts` +- [ ] **3.12** Create API route: `mindlyst-native/web/src/pages/api/extract.ts` + - Accepts raw capture text, calls extraction-service with `triage` task + - Returns brain routing + extracted entities +- [ ] **3.13** Update triage flow on web dashboard to use extraction results for brain auto-routing +- [ ] **3.14** Wire brain insight generation to use `memory-insight` task +- [ ] **3.15** Wire reflection enrichment to use `reflection-enrichment` task + +### Tests + +- [ ] **3.16** Add integration tests for LysnrAI extraction endpoint +- [ ] **3.17** Add integration tests for MindLyst triage-via-extraction flow +- [ ] **3.18** Verify: `npx tsc --noEmit` passes in all 3 dashboards + MindLyst web + +--- + +## Phase 4 — Docker & DevOps + +> **Goal:** Containerize, add to docker-compose, update run scripts. + +### Dockerfile + +- [ ] **4.1** Create multi-stage `Dockerfile` for extraction-service: + - Stage 1: Node.js build (Fastify TS → JS) + - Stage 2: Python setup (install langextract + deps) + - Stage 3: Runtime (Node.js + Python, supervisord to run both processes) +- [ ] **4.2** Create `supervisord.conf` to manage Fastify (port 4005) + Python sidecar (port 4006) +- [ ] **4.3** Verify: `docker build` succeeds + +### Docker Compose + +- [ ] **4.4** Add `extraction-service` to `docker-compose.yml`: + ```yaml + extraction-service: + build: + context: . + dockerfile: services/extraction-service/Dockerfile + ports: + - '4005:4005' + env_file: + - .env + environment: + - PORT=4005 + - PYTHON_SIDECAR_URL=http://localhost:4006 + labels: + - 'traefik.enable=true' + - 'traefik.http.routers.extraction.rule=PathPrefix(`/api/extract`) || PathPrefix(`/api/tasks`)' + - 'traefik.http.services.extraction.loadbalancer.server.port=4005' + logging: + driver: loki + options: + loki-url: 'http://host.docker.internal:3100/loki/api/v1/push' + loki-retries: '3' + restart: unless-stopped + healthcheck: + test: ['CMD', 'wget', '--no-verbose', '--tries=1', '--spider', 'http://localhost:4005/health'] + interval: 30s + timeout: 10s + retries: 3 + ``` +- [ ] **4.5** Add to LysnrAI `docker-compose.yml` (references `../learning_ai_common_plat/services/extraction-service/`) + +### Run scripts + +- [ ] **4.6** Add extraction-service to `run-local-all-services.sh` in LysnrAI repo +- [ ] **4.7** Add extraction-service to `.windsurf/workflows/start-all-services.md` +- [ ] **4.8** Add `.env.example` entries to LysnrAI repo root (`EXTRACTION_SERVICE_URL=http://localhost:4005`) +- [ ] **4.9** Add `.env.example` entries to MindLyst web (same) + +### CI + +- [ ] **4.10** Create `.github/workflows/ci-extraction-service.yml`: + - Trigger: push to `services/extraction-service/**` or `packages/extraction/**` + - Steps: pnpm install, pnpm build, pnpm test (TS), pip install + pytest (Python) +- [ ] **4.11** Verify: CI workflow passes + +--- + +## Phase 5 — Production Hardening + +> **Goal:** Rate limiting, caching, observability, cost controls. + +### Caching + +- [ ] **5.1** Add result caching in Python sidecar: + - Cache key: hash(task_id + input_text + model_id) + - TTL: configurable (default 24h) + - Storage: in-memory LRU (dev) or Redis (prod) +- [ ] **5.2** Add cache hit/miss headers to Fastify response (`X-Extraction-Cache: HIT/MISS`) + +### Cost controls + +- [ ] **5.3** Add per-user daily extraction quota (configurable per plan tier): + - Free: 10 extractions/day + - Pro: 100 extractions/day + - Enterprise: unlimited +- [ ] **5.4** Track usage in Cosmos `extraction_usage` container (partition: `/userId`) +- [ ] **5.5** Return `429 Too Many Requests` with quota info when exceeded +- [ ] **5.6** Add usage reporting endpoint: `GET /api/extract/usage` (admin) + +### Observability + +- [ ] **5.7** Add structured logging for every extraction: + - Request: task_id, input_length, model_id, user_id, product_id + - Response: entity_count, duration_ms, token_count, cache_hit +- [ ] **5.8** Add Prometheus metrics (via `fastify-metrics`): + - `extraction_requests_total` (labels: task_id, model_id, product_id, status) + - `extraction_duration_seconds` (histogram) + - `extraction_entities_extracted` (histogram) + - `extraction_cache_hit_total` +- [ ] **5.9** Add Grafana dashboard for extraction service (in `services/monitoring/grafana/dashboards/`) + +### Error handling + +- [ ] **5.10** Map LangExtract errors to `@bytelyst/errors`: + - Model timeout → `408 Request Timeout` + - Rate limit (upstream LLM) → `429 Too Many Requests` with retry-after + - Invalid task definition → `400 Bad Request` + - Model unavailable → `503 Service Unavailable` +- [ ] **5.11** Add circuit breaker for Python sidecar (fail fast if sidecar is down) +- [ ] **5.12** Add graceful degradation: return partial results if some chunks fail + +--- + +## Phase 6 — Advanced Features (Future) + +> **Goal:** Power-user features, visualization, and batch processing. + +### Visualization + +- [ ] **6.1** Expose LangExtract's HTML visualization: + - `GET /api/extract/:requestId/visualization` — returns interactive HTML + - Embed in admin dashboard for extraction quality review +- [ ] **6.2** Store visualization artifacts in Azure Blob Storage (`extractions` container) + +### Batch & async processing + +- [ ] **6.3** Add async extraction endpoint: + - `POST /api/extract/async` — returns job ID immediately + - `GET /api/extract/jobs/:id` — poll for status + results + - Webhook callback when complete +- [ ] **6.4** Add Vertex AI batch processing support (for high-volume MindLyst triage) + +### Custom model support + +- [ ] **6.5** Add Ollama provider for local/air-gapped deployments +- [ ] **6.6** Add model benchmarking endpoint: run same task across models, compare quality + cost + +### Multi-language extraction + +- [ ] **6.7** Test and validate extraction across languages (LangExtract supports multi-language via LLM) +- [ ] **6.8** Add language detection to extraction pipeline (auto-detect input language) + +--- + +## Env Vars Summary + +| Variable | Service | Default | Description | +| ------------------------ | ------------------ | ----------------------- | ----------------------------------- | +| `PORT` | extraction-service | `4005` | Fastify listen port | +| `HOST` | extraction-service | `0.0.0.0` | Fastify listen host | +| `CORS_ORIGIN` | extraction-service | `*` | Allowed origins | +| `PYTHON_SIDECAR_URL` | extraction-service | `http://localhost:4006` | Python sidecar URL | +| `DEFAULT_MODEL_ID` | extraction-service | `gemini-2.5-flash` | Default LLM model | +| `GEMINI_API_KEY` | python sidecar | — | Google Gemini API key | +| `AZURE_OPENAI_API_KEY` | python sidecar | — | Azure OpenAI key (alternative) | +| `AZURE_OPENAI_ENDPOINT` | python sidecar | — | Azure OpenAI endpoint (alternative) | +| `MAX_WORKERS` | python sidecar | `10` | Parallel extraction workers | +| `MAX_CHAR_BUFFER` | python sidecar | `2000` | Chunk size for long docs | +| `EXTRACTION_CACHE_TTL` | python sidecar | `86400` | Cache TTL in seconds | +| `COSMOS_ENDPOINT` | extraction-service | — | Azure Cosmos DB endpoint | +| `COSMOS_KEY` | extraction-service | — | Azure Cosmos DB key | +| `COSMOS_DATABASE` | extraction-service | `lysnrai` | Database name | +| `JWT_SECRET` | extraction-service | — | JWT validation secret | +| `EXTRACTION_SERVICE_URL` | consumers | `http://localhost:4005` | Used by dashboards/backends | + +--- + +## Port Allocation + +| Service | Port | +| -------------------------------------------- | -------- | +| growth-service | 4001 | +| billing-service | 4002 | +| platform-service | 4003 | +| tracker-service | 4004 | +| **extraction-service** | **4005** | +| extraction-service python sidecar (internal) | 4006 | + +--- + +## Dependency Graph + +``` +@bytelyst/extraction (package) + └── @bytelyst/api-client (peer dep) + +@lysnrai/extraction-service (service) + ├── @bytelyst/fastify-core + ├── @bytelyst/auth + ├── @bytelyst/config + ├── @bytelyst/cosmos + ├── @bytelyst/errors + ├── fastify, zod, jose (direct deps) + └── python sidecar + └── langextract, fastapi, uvicorn, structlog +``` + +--- + +## Estimated Effort + +| Phase | Effort | Dependencies | +| ------------------------------ | -------- | ------------ | +| Phase 0 — Foundation | 2–3 days | None | +| Phase 1 — Core API | 2–3 days | Phase 0 | +| Phase 2 — Task Library | 2 days | Phase 1 | +| Phase 3 — Consumer Integration | 3–4 days | Phase 2 | +| Phase 4 — Docker & DevOps | 1–2 days | Phase 1 | +| Phase 5 — Production Hardening | 2–3 days | Phase 3 | +| Phase 6 — Advanced (future) | Ongoing | Phase 5 | + +**Total MVP (Phases 0–4): ~10–14 days** + +--- + +## Rollback Strategy + +- The extraction-service is **additive** — no existing code is modified until Phase 3 +- Phase 3 consumer integration uses new endpoints/routes — existing triage/transcript flows remain untouched +- If extraction-service is down, consumers fall back to their existing behavior (MindLyst mock triage, LysnrAI raw transcripts) +- The `@bytelyst/extraction` package is optional — dashboards only import it for new extraction features