# Extraction Service — Roadmap & Task Checklist > **Service:** `@lysnrai/extraction-service` (port 4005) > **Package:** `@bytelyst/extraction` (shared types + client) > **Core dependency:** [google/langextract](https://github.com/google/langextract) (Python) > > **Companion docs:** [ECOSYSTEM_ARCHITECTURE.md](./ECOSYSTEM_ARCHITECTURE.md) · [ROADMAP.md](./ROADMAP.md) --- ## Overview A shared extraction microservice that uses Google's LangExtract library to extract structured information from unstructured text. Both LysnrAI and MindLyst consume this service for their respective extraction needs. **Architecture:** Fastify (routing, auth, validation, request tracing) + Python sidecar (LangExtract). The Fastify layer keeps the service consistent with the other 4 services. The Python process handles the actual LLM-powered extraction. ``` ┌──────────────────────────────────────────────────────────┐ │ extraction-service │ │ (port 4005) │ │ │ │ ┌─────────────────────┐ ┌──────────────────────────┐ │ │ │ Fastify (TS) │ │ Python Sidecar │ │ │ │ │ │ │ │ │ │ - Auth middleware │──►│ - LangExtract wrapper │ │ │ │ - Zod validation │◄──│ - Task registry │ │ │ │ - x-request-id │ │ - Model provider config │ │ │ │ - Rate limiting │ │ - Result caching │ │ │ │ - /health │ │ │ │ │ └─────────────────────┘ └──────────────────────────┘ │ └──────────────────────────────────────────────────────────┘ ▲ ▲ │ │ REST API FastAPI (internal :4006) (external) or subprocess stdio ``` ### Consumers | Product | Use Case | Entry Point | | ----------------------------- | ------------------------------------------------------------------------ | -------------------------------------------------- | | **LysnrAI** — Desktop/Backend | Post-transcription extraction (action items, decisions, dates, people) | `backend/src/clients/extraction_client.py` | | **LysnrAI** — Admin Dashboard | Transcript analytics, entity review | `admin-dashboard-web/src/lib/extraction-client.ts` | | **MindLyst** — KMP/Web | Triage pipeline (brain routing, entity extraction, topic classification) | `mindlyst-native/web/src/pages/api/triage.ts` | | **MindLyst** — Web Dashboard | Brain insight generation, reflection enrichment | Direct API calls via `@bytelyst/api-client` | --- ## Phase 0 — Foundation & Scaffolding > **Goal:** Set up the service skeleton, Python environment, and build pipeline. ### Service scaffold (Fastify) - [x] **0.1** Create `services/extraction-service/` directory structure: [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5) ``` services/extraction-service/ src/ lib/ config.ts # Zod config schema (PORT, HOST, CORS, PYTHON_SIDECAR_URL, etc.) errors.ts # Re-export from @bytelyst/errors cosmos.ts # Re-export from @bytelyst/cosmos (for task registry persistence) product-config.ts # Re-export from @bytelyst/config python-bridge.ts # HTTP client to Python sidecar modules/ extract/ types.ts # Zod schemas: ExtractionTask, ExtractionExample, ExtractionResult routes.ts # POST /api/extract, POST /api/extract/batch, GET /api/tasks tasks/ types.ts # Predefined task definitions (triage, transcript, etc.) repository.ts # Cosmos CRUD for custom task definitions routes.ts # CRUD endpoints for task management server.ts # createServiceApp + route registration package.json tsconfig.json Dockerfile ``` - [x] **0.2** Create `package.json` (`@lysnrai/extraction-service`, port 4005) matching existing service conventions [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5) - [x] **0.3** Create `tsconfig.json` (self-contained, matching tracker-service pattern) [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5) - [x] **0.4** Create `src/lib/config.ts` with Zod schema (PORT, HOST, NODE*ENV, CORS_ORIGIN, SERVICE_NAME, PYTHON_SIDECAR_URL, DEFAULT_MODEL_ID, COSMOS*\*, JWT_SECRET, DEFAULT_PRODUCT_ID) [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5) - [x] **0.5** Create `src/server.ts` using `createServiceApp()` + `startService()` from `@bytelyst/fastify-core` [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5) - [x] **0.6** Add `.env.example` with all required env vars [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5) - [x] **0.7** Verify: `pnpm build` passes for the new service [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5) ### Python sidecar scaffold - [x] **0.8** Create `services/extraction-service/python/` directory: [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5) ``` python/ src/ __init__.py app.py # FastAPI app (internal, port 4006) extractor.py # LangExtract wrapper task_registry.py # Built-in task definitions models.py # Pydantic models matching TS Zod schemas requirements.txt # langextract, fastapi, uvicorn, pydantic Dockerfile # Python 3.12 slim ``` - [x] **0.9** Create `python/requirements.txt`: [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5) ``` langextract>=0.3.0 fastapi>=0.115.0 uvicorn>=0.34.0 pydantic>=2.10.0 pydantic-settings>=2.7.0 structlog>=24.4.0 ``` - [x] **0.10** Create `python/src/app.py` — FastAPI app with POST /extract, POST /extract/batch, GET /health [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5) - [x] **0.11** Create `python/src/extractor.py` — wrapper around `lx.extract()` with mock fallback [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5) - [ ] **0.12** Verify: Python sidecar starts and `/health` returns OK (requires `pip install` — deferred to Phase 1) ### Package scaffold (`@bytelyst/extraction`) - [x] **0.13** Create `packages/extraction/` directory: [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5) ``` packages/extraction/ src/ index.ts # Public API types.ts # Shared TypeScript types client.ts # createExtractionClient() factory package.json tsconfig.json ``` - [x] **0.14** Create `package.json` (`@bytelyst/extraction`) with `@bytelyst/api-client` as peer dep [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5) - [x] **0.15** Define TypeScript types (ExtractionTask, ExtractionExample, ExtractionEntity, ExtractRequest, ExtractResponse, BatchExtractRequest, BatchExtractResponse) [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5) - [x] **0.16** Create `createExtractionClient()` factory using `createApiClient()` pattern [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5) - [x] **0.17** Verify: `pnpm build` passes for the new package [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5) ### Workspace wiring - [x] **0.18** Verify `extraction-service` and `extraction` covered by `packages/*` + `services/*` globs in `pnpm-workspace.yaml` [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5) - [x] **0.19** Run `pnpm install` from repo root — workspace resolution verified [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5) - [x] **0.20** Verify: `pnpm build` passes for both extraction-service and @bytelyst/extraction [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5) --- ## Phase 1 — Core Extraction API > **Goal:** Working extraction endpoint that accepts text + task definition and returns structured results via LangExtract. ### Python extractor implementation - [ ] **1.1** Implement `extractor.py`: - Accept task definition (prompt, examples, model config) - Accept input text (string or URL) - Call `lx.extract()` with configurable parameters (model_id, extraction_passes, max_workers, max_char_buffer) - Return structured results with source grounding (extraction_class, extraction_text, attributes, char offsets) - Handle errors gracefully (model timeout, rate limit, invalid input) - [ ] **1.2** Implement model provider configuration: - Gemini (default): API key from env - Azure OpenAI: endpoint + key from env - Ollama (local dev): configurable base URL - [ ] **1.3** Add request/response logging via `structlog` (never `print()`) - [ ] **1.4** Add request timeout configuration (default 120s for long documents) ### Fastify routes - [ ] **1.5** Implement `src/modules/extract/types.ts`: - `ExtractRequestSchema` (Zod) — task definition + input text + options - `ExtractResponseSchema` (Zod) — array of extractions + metadata - `BatchExtractRequestSchema` — array of inputs + shared task - [ ] **1.6** Implement `src/modules/extract/routes.ts`: - `POST /api/extract` — auth required, validates input, proxies to Python sidecar - `POST /api/extract/batch` — auth required, accepts multiple inputs - `GET /api/extract/models` — list available model providers - [ ] **1.7** Implement `src/lib/python-bridge.ts`: - HTTP client to Python sidecar (fetch with timeout, retry, error mapping) - Health check polling on startup (wait for sidecar readiness) - Request ID forwarding (`x-request-id`) - [ ] **1.8** Add rate limiting to extraction endpoints (configurable per-user limit) ### Tests - [ ] **1.9** Write unit tests for Zod schemas (`types.test.ts`) - [ ] **1.10** Write integration tests for extract routes (mock Python sidecar responses) - [ ] **1.11** Write Python unit tests for `extractor.py` (mock `lx.extract`) - [ ] **1.12** Verify: `pnpm test` passes, `pytest` passes --- ## Phase 2 — Predefined Task Library > **Goal:** Ship a curated set of extraction task definitions that LysnrAI and MindLyst can use out-of-the-box. ### Task definitions - [ ] **2.1** Define `transcript-extraction` task: - Classes: `action_item`, `decision`, `question`, `deadline`, `person`, `topic` - 3–5 few-shot examples from realistic meeting transcripts - Default model: `gemini-2.5-flash` - [ ] **2.2** Define `triage` task (MindLyst): - Classes: `topic`, `entity`, `action`, `emotion`, `date_reference`, `brain_signal` - brain_signal attributes: `{ brain: "work|home|money|health|global", confidence: float }` - 3–5 few-shot examples per brain type - [ ] **2.3** Define `memory-insight` task (MindLyst): - Classes: `pattern`, `recurring_theme`, `relationship`, `milestone` - Examples from accumulated brain memories - [ ] **2.4** Define `reflection-enrichment` task (MindLyst): - Classes: `emotional_state`, `accomplishment`, `concern`, `goal_progress` - Examples from journal-style text - [ ] **2.5** Define `bug-report-extraction` task (Tracker): - Classes: `steps_to_reproduce`, `expected_behavior`, `actual_behavior`, `affected_component`, `severity` - Examples from real issue submissions ### Task registry (Cosmos DB) - [ ] **2.6** Create Cosmos container: `extraction_tasks` (partition key: `/productId`) - [ ] **2.7** Implement `src/modules/tasks/repository.ts` — CRUD for task definitions - [ ] **2.8** Implement `src/modules/tasks/routes.ts`: - `GET /api/tasks` — list all tasks (built-in + custom) - `GET /api/tasks/:id` — get task by ID - `POST /api/tasks` — create custom task (admin only) - `PUT /api/tasks/:id` — update task (admin only) - `DELETE /api/tasks/:id` — delete custom task (admin only) - [ ] **2.9** Seed built-in tasks on service startup (idempotent upsert) - [ ] **2.10** Add `productId` to all task documents ### Python task registry - [ ] **2.11** Implement `task_registry.py` — load task definitions from Cosmos (via Fastify API) or local JSON fallback - [ ] **2.12** Create `python/tasks/` directory with JSON files for each built-in task - [ ] **2.13** Add task validation: verify examples follow LangExtract best practices (ordered, verbatim, no overlap) ### Tests - [ ] **2.14** Write tests for task CRUD routes - [ ] **2.15** Write tests for task seeding logic - [ ] **2.16** Verify: all tests pass --- ## Phase 3 — Consumer Integration > **Goal:** Wire LysnrAI and MindLyst to call the extraction service. ### `@bytelyst/extraction` package finalization - [ ] **3.1** Add typed methods to `createExtractionClient()`: - `extract(input, taskId, options?)` — single extraction - `extractBatch(inputs, taskId, options?)` — batch extraction - `listTasks()` — get available tasks - `getTask(id)` — get task details - [ ] **3.2** Export all types from `src/index.ts` - [ ] **3.3** Publish: `pnpm build` in `packages/extraction/` ### LysnrAI integration - [ ] **3.4** Add `@bytelyst/extraction` to `admin-dashboard-web/package.json` (via `file:` ref) - [ ] **3.5** Create `admin-dashboard-web/src/lib/extraction-client.ts` — typed client instance - [ ] **3.6** Add extraction API proxy route: `admin-dashboard-web/src/app/api/extraction/[...path]/route.ts` - [ ] **3.7** Create Python extraction client in `backend/src/clients/extraction_client.py`: - HTTP client to extraction-service (port 4005) - Methods: `extract_transcript(text)`, `extract_batch(texts)` - [ ] **3.8** Add post-transcription extraction to LysnrAI backend: - New endpoint: `POST /api/transcripts/{id}/extract` - Calls extraction-service with `transcript-extraction` task - Stores results alongside transcript - [ ] **3.9** Add extraction results display to admin dashboard (transcript detail page) ### MindLyst integration - [ ] **3.10** Add `@bytelyst/extraction` to `mindlyst-native/web/package.json` (via `file:` ref): ``` "@bytelyst/extraction": "file:../../../learning_ai_common_plat/packages/extraction" ``` - [ ] **3.11** Create `mindlyst-native/web/src/lib/extraction-client.ts` - [ ] **3.12** Create API route: `mindlyst-native/web/src/pages/api/extract.ts` - Accepts raw capture text, calls extraction-service with `triage` task - Returns brain routing + extracted entities - [ ] **3.13** Update triage flow on web dashboard to use extraction results for brain auto-routing - [ ] **3.14** Wire brain insight generation to use `memory-insight` task - [ ] **3.15** Wire reflection enrichment to use `reflection-enrichment` task ### Tests - [ ] **3.16** Add integration tests for LysnrAI extraction endpoint - [ ] **3.17** Add integration tests for MindLyst triage-via-extraction flow - [ ] **3.18** Verify: `npx tsc --noEmit` passes in all 3 dashboards + MindLyst web --- ## Phase 4 — Docker & DevOps > **Goal:** Containerize, add to docker-compose, update run scripts. ### Dockerfile - [ ] **4.1** Create multi-stage `Dockerfile` for extraction-service: - Stage 1: Node.js build (Fastify TS → JS) - Stage 2: Python setup (install langextract + deps) - Stage 3: Runtime (Node.js + Python, supervisord to run both processes) - [ ] **4.2** Create `supervisord.conf` to manage Fastify (port 4005) + Python sidecar (port 4006) - [ ] **4.3** Verify: `docker build` succeeds ### Docker Compose - [ ] **4.4** Add `extraction-service` to `docker-compose.yml`: ```yaml extraction-service: build: context: . dockerfile: services/extraction-service/Dockerfile ports: - '4005:4005' env_file: - .env environment: - PORT=4005 - PYTHON_SIDECAR_URL=http://localhost:4006 labels: - 'traefik.enable=true' - 'traefik.http.routers.extraction.rule=PathPrefix(`/api/extract`) || PathPrefix(`/api/tasks`)' - 'traefik.http.services.extraction.loadbalancer.server.port=4005' logging: driver: loki options: loki-url: 'http://host.docker.internal:3100/loki/api/v1/push' loki-retries: '3' restart: unless-stopped healthcheck: test: ['CMD', 'wget', '--no-verbose', '--tries=1', '--spider', 'http://localhost:4005/health'] interval: 30s timeout: 10s retries: 3 ``` - [ ] **4.5** Add to LysnrAI `docker-compose.yml` (references `../learning_ai_common_plat/services/extraction-service/`) ### Run scripts - [ ] **4.6** Add extraction-service to `run-local-all-services.sh` in LysnrAI repo - [ ] **4.7** Add extraction-service to `.windsurf/workflows/start-all-services.md` - [ ] **4.8** Add `.env.example` entries to LysnrAI repo root (`EXTRACTION_SERVICE_URL=http://localhost:4005`) - [ ] **4.9** Add `.env.example` entries to MindLyst web (same) ### CI - [ ] **4.10** Create `.github/workflows/ci-extraction-service.yml`: - Trigger: push to `services/extraction-service/**` or `packages/extraction/**` - Steps: pnpm install, pnpm build, pnpm test (TS), pip install + pytest (Python) - [ ] **4.11** Verify: CI workflow passes --- ## Phase 5 — Production Hardening > **Goal:** Rate limiting, caching, observability, cost controls. ### Caching - [ ] **5.1** Add result caching in Python sidecar: - Cache key: hash(task_id + input_text + model_id) - TTL: configurable (default 24h) - Storage: in-memory LRU (dev) or Redis (prod) - [ ] **5.2** Add cache hit/miss headers to Fastify response (`X-Extraction-Cache: HIT/MISS`) ### Cost controls - [ ] **5.3** Add per-user daily extraction quota (configurable per plan tier): - Free: 10 extractions/day - Pro: 100 extractions/day - Enterprise: unlimited - [ ] **5.4** Track usage in Cosmos `extraction_usage` container (partition: `/userId`) - [ ] **5.5** Return `429 Too Many Requests` with quota info when exceeded - [ ] **5.6** Add usage reporting endpoint: `GET /api/extract/usage` (admin) ### Observability - [ ] **5.7** Add structured logging for every extraction: - Request: task_id, input_length, model_id, user_id, product_id - Response: entity_count, duration_ms, token_count, cache_hit - [ ] **5.8** Add Prometheus metrics (via `fastify-metrics`): - `extraction_requests_total` (labels: task_id, model_id, product_id, status) - `extraction_duration_seconds` (histogram) - `extraction_entities_extracted` (histogram) - `extraction_cache_hit_total` - [ ] **5.9** Add Grafana dashboard for extraction service (in `services/monitoring/grafana/dashboards/`) ### Error handling - [ ] **5.10** Map LangExtract errors to `@bytelyst/errors`: - Model timeout → `408 Request Timeout` - Rate limit (upstream LLM) → `429 Too Many Requests` with retry-after - Invalid task definition → `400 Bad Request` - Model unavailable → `503 Service Unavailable` - [ ] **5.11** Add circuit breaker for Python sidecar (fail fast if sidecar is down) - [ ] **5.12** Add graceful degradation: return partial results if some chunks fail --- ## Phase 6 — Advanced Features (Future) > **Goal:** Power-user features, visualization, and batch processing. ### Visualization - [ ] **6.1** Expose LangExtract's HTML visualization: - `GET /api/extract/:requestId/visualization` — returns interactive HTML - Embed in admin dashboard for extraction quality review - [ ] **6.2** Store visualization artifacts in Azure Blob Storage (`extractions` container) ### Batch & async processing - [ ] **6.3** Add async extraction endpoint: - `POST /api/extract/async` — returns job ID immediately - `GET /api/extract/jobs/:id` — poll for status + results - Webhook callback when complete - [ ] **6.4** Add Vertex AI batch processing support (for high-volume MindLyst triage) ### Custom model support - [ ] **6.5** Add Ollama provider for local/air-gapped deployments - [ ] **6.6** Add model benchmarking endpoint: run same task across models, compare quality + cost ### Multi-language extraction - [ ] **6.7** Test and validate extraction across languages (LangExtract supports multi-language via LLM) - [ ] **6.8** Add language detection to extraction pipeline (auto-detect input language) --- ## Env Vars Summary | Variable | Service | Default | Description | | ------------------------ | ------------------ | ----------------------- | ----------------------------------- | | `PORT` | extraction-service | `4005` | Fastify listen port | | `HOST` | extraction-service | `0.0.0.0` | Fastify listen host | | `CORS_ORIGIN` | extraction-service | `*` | Allowed origins | | `PYTHON_SIDECAR_URL` | extraction-service | `http://localhost:4006` | Python sidecar URL | | `DEFAULT_MODEL_ID` | extraction-service | `gemini-2.5-flash` | Default LLM model | | `GEMINI_API_KEY` | python sidecar | — | Google Gemini API key | | `AZURE_OPENAI_API_KEY` | python sidecar | — | Azure OpenAI key (alternative) | | `AZURE_OPENAI_ENDPOINT` | python sidecar | — | Azure OpenAI endpoint (alternative) | | `MAX_WORKERS` | python sidecar | `10` | Parallel extraction workers | | `MAX_CHAR_BUFFER` | python sidecar | `2000` | Chunk size for long docs | | `EXTRACTION_CACHE_TTL` | python sidecar | `86400` | Cache TTL in seconds | | `COSMOS_ENDPOINT` | extraction-service | — | Azure Cosmos DB endpoint | | `COSMOS_KEY` | extraction-service | — | Azure Cosmos DB key | | `COSMOS_DATABASE` | extraction-service | `lysnrai` | Database name | | `JWT_SECRET` | extraction-service | — | JWT validation secret | | `EXTRACTION_SERVICE_URL` | consumers | `http://localhost:4005` | Used by dashboards/backends | --- ## Port Allocation | Service | Port | | -------------------------------------------- | -------- | | growth-service | 4001 | | billing-service | 4002 | | platform-service | 4003 | | tracker-service | 4004 | | **extraction-service** | **4005** | | extraction-service python sidecar (internal) | 4006 | --- ## Dependency Graph ``` @bytelyst/extraction (package) └── @bytelyst/api-client (peer dep) @lysnrai/extraction-service (service) ├── @bytelyst/fastify-core ├── @bytelyst/auth ├── @bytelyst/config ├── @bytelyst/cosmos ├── @bytelyst/errors ├── fastify, zod, jose (direct deps) └── python sidecar └── langextract, fastapi, uvicorn, structlog ``` --- ## Estimated Effort | Phase | Effort | Dependencies | | ------------------------------ | -------- | ------------ | | Phase 0 — Foundation | 2–3 days | None | | Phase 1 — Core API | 2–3 days | Phase 0 | | Phase 2 — Task Library | 2 days | Phase 1 | | Phase 3 — Consumer Integration | 3–4 days | Phase 2 | | Phase 4 — Docker & DevOps | 1–2 days | Phase 1 | | Phase 5 — Production Hardening | 2–3 days | Phase 3 | | Phase 6 — Advanced (future) | Ongoing | Phase 5 | **Total MVP (Phases 0–4): ~10–14 days** --- ## Rollback Strategy - The extraction-service is **additive** — no existing code is modified until Phase 3 - Phase 3 consumer integration uses new endpoints/routes — existing triage/transcript flows remain untouched - If extraction-service is down, consumers fall back to their existing behavior (MindLyst mock triage, LysnrAI raw transcripts) - The `@bytelyst/extraction` package is optional — dashboards only import it for new extraction features