learning_ai_common_plat/docs/EXTRACTION_SERVICE_ROADMAP.md
saravanakumardb1 ac17e99aca docs(extraction): add completion status + deferred items table + verification summary
All 68 items checked. 5 deferred sub-tasks listed with action needed:
- 4.3: Docker build not yet run
- 4.11: CI disabled (billing)
- 5.4: Cosmos usage persistence (Phase 7)
- 6.2: Blob storage for visualizations
- 6.4: Webhook callback for async jobs
2026-02-14 14:12:00 -08:00

425 lines
31 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Extraction Service — Roadmap & Task Checklist
> **Service:** `@lysnrai/extraction-service` (port 4005)
> **Package:** `@bytelyst/extraction` (shared types + client)
> **Core dependency:** [google/langextract](https://github.com/google/langextract) (Python)
>
> **Companion docs:** [ECOSYSTEM_ARCHITECTURE.md](./ECOSYSTEM_ARCHITECTURE.md) · [ROADMAP.md](./ROADMAP.md)
---
## Overview
A shared extraction microservice that uses Google's LangExtract library to extract structured information from unstructured text. Both LysnrAI and MindLyst consume this service for their respective extraction needs.
**Architecture:** Fastify (routing, auth, validation, request tracing) + Python sidecar (LangExtract). The Fastify layer keeps the service consistent with the other 4 services. The Python process handles the actual LLM-powered extraction.
```
┌──────────────────────────────────────────────────────────┐
│ extraction-service │
│ (port 4005) │
│ │
│ ┌─────────────────────┐ ┌──────────────────────────┐ │
│ │ Fastify (TS) │ │ Python Sidecar │ │
│ │ │ │ │ │
│ │ - Auth middleware │──►│ - LangExtract wrapper │ │
│ │ - Zod validation │◄──│ - Task registry │ │
│ │ - x-request-id │ │ - Model provider config │ │
│ │ - Rate limiting │ │ - Result caching │ │
│ │ - /health │ │ │ │
│ └─────────────────────┘ └──────────────────────────┘ │
└──────────────────────────────────────────────────────────┘
▲ ▲
│ │
REST API FastAPI (internal :4006)
(external) or subprocess stdio
```
### Consumers
| Product | Use Case | Entry Point |
| ----------------------------- | ------------------------------------------------------------------------ | -------------------------------------------------- |
| **LysnrAI** — Desktop/Backend | Post-transcription extraction (action items, decisions, dates, people) | `backend/src/clients/extraction_client.py` |
| **LysnrAI** — Admin Dashboard | Transcript analytics, entity review | `admin-dashboard-web/src/lib/extraction-client.ts` |
| **MindLyst** — KMP/Web | Triage pipeline (brain routing, entity extraction, topic classification) | `mindlyst-native/web/src/pages/api/triage.ts` |
| **MindLyst** — Web Dashboard | Brain insight generation, reflection enrichment | Direct API calls via `@bytelyst/api-client` |
---
## Phase 0 — Foundation & Scaffolding
> **Goal:** Set up the service skeleton, Python environment, and build pipeline.
### Service scaffold (Fastify)
- [x] **0.1** Create `services/extraction-service/` directory structure: [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5)
```
services/extraction-service/
src/
lib/
config.ts # Zod config schema (PORT, HOST, CORS, PYTHON_SIDECAR_URL, etc.)
errors.ts # Re-export from @bytelyst/errors
cosmos.ts # Re-export from @bytelyst/cosmos (for task registry persistence)
product-config.ts # Re-export from @bytelyst/config
python-bridge.ts # HTTP client to Python sidecar
modules/
extract/
types.ts # Zod schemas: ExtractionTask, ExtractionExample, ExtractionResult
routes.ts # POST /api/extract, POST /api/extract/batch, GET /api/tasks
tasks/
types.ts # Predefined task definitions (triage, transcript, etc.)
repository.ts # Cosmos CRUD for custom task definitions
routes.ts # CRUD endpoints for task management
server.ts # createServiceApp + route registration
package.json
tsconfig.json
Dockerfile
```
- [x] **0.2** Create `package.json` (`@lysnrai/extraction-service`, port 4005) matching existing service conventions [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5)
- [x] **0.3** Create `tsconfig.json` (self-contained, matching tracker-service pattern) [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5)
- [x] **0.4** Create `src/lib/config.ts` with Zod schema (PORT, HOST, NODE*ENV, CORS_ORIGIN, SERVICE_NAME, PYTHON_SIDECAR_URL, DEFAULT_MODEL_ID, COSMOS*\*, JWT_SECRET, DEFAULT_PRODUCT_ID) [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5)
- [x] **0.5** Create `src/server.ts` using `createServiceApp()` + `startService()` from `@bytelyst/fastify-core` [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5)
- [x] **0.6** Add `.env.example` with all required env vars [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5)
- [x] **0.7** Verify: `pnpm build` passes for the new service [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5)
### Python sidecar scaffold
- [x] **0.8** Create `services/extraction-service/python/` directory: [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5)
```
python/
src/
__init__.py
app.py # FastAPI app (internal, port 4006)
extractor.py # LangExtract wrapper
task_registry.py # Built-in task definitions
models.py # Pydantic models matching TS Zod schemas
requirements.txt # langextract, fastapi, uvicorn, pydantic
Dockerfile # Python 3.12 slim
```
- [x] **0.9** Create `python/requirements.txt`: [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5)
```
langextract>=0.3.0
fastapi>=0.115.0
uvicorn>=0.34.0
pydantic>=2.10.0
pydantic-settings>=2.7.0
structlog>=24.4.0
```
- [x] **0.10** Create `python/src/app.py` — FastAPI app with POST /extract, POST /extract/batch, GET /health [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5)
- [x] **0.11** Create `python/src/extractor.py` — wrapper around `lx.extract()` with mock fallback [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5)
- [x] **0.12** Verify: Python sidecar starts and `/health` returns OK [`c9d5c0c`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c9d5c0c)
### Package scaffold (`@bytelyst/extraction`)
- [x] **0.13** Create `packages/extraction/` directory: [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5)
```
packages/extraction/
src/
index.ts # Public API
types.ts # Shared TypeScript types
client.ts # createExtractionClient() factory
package.json
tsconfig.json
```
- [x] **0.14** Create `package.json` (`@bytelyst/extraction`) with `@bytelyst/api-client` as peer dep [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5)
- [x] **0.15** Define TypeScript types (ExtractionTask, ExtractionExample, ExtractionEntity, ExtractRequest, ExtractResponse, BatchExtractRequest, BatchExtractResponse) [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5)
- [x] **0.16** Create `createExtractionClient()` factory using `createApiClient()` pattern [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5)
- [x] **0.17** Verify: `pnpm build` passes for the new package [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5)
### Workspace wiring
- [x] **0.18** Verify `extraction-service` and `extraction` covered by `packages/*` + `services/*` globs in `pnpm-workspace.yaml` [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5)
- [x] **0.19** Run `pnpm install` from repo root — workspace resolution verified [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5)
- [x] **0.20** Verify: `pnpm build` passes for both extraction-service and @bytelyst/extraction [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5)
---
## Phase 1 — Core Extraction API
> **Goal:** Working extraction endpoint that accepts text + task definition and returns structured results via LangExtract.
### Python extractor implementation
- [x] **1.1** Implement `extractor.py` — LangExtract wrapper with mock fallback, configurable model_id, extraction_passes, max_workers, max_char_buffer [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5)
- [x] **1.2** Model provider configuration — Gemini default via DEFAULT_MODEL_ID env var, model_id passthrough to lx.extract() [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5)
- [x] **1.3** structlog logging in extractor.py and app.py (extraction_complete, extraction_failed, extract_request) [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5)
- [x] **1.4** Request timeout in python-bridge.ts (DEFAULT_TIMEOUT_MS = 120s, configurable per-call) [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5)
### Fastify routes
- [x] **1.5** Implement `src/modules/extract/types.ts` — ExtractRequestSchema, ExtractResponseSchema, BatchExtractRequestSchema (Zod) [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5)
- [x] **1.6** Implement `src/modules/extract/routes.ts` — POST /extract, POST /extract/batch, GET /extract/models, GET /extract/sidecar-health [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5)
- [x] **1.7** Implement `src/lib/python-bridge.ts` — sidecarExtract, sidecarExtractBatch, sidecarHealth, waitForSidecar with x-request-id forwarding [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5)
- [x] **1.8** Rate limiting on extract routes (30 req/min per IP via @fastify/rate-limit) [`0a87d19`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/0a87d19)
### Tests
- [x] **1.9** Unit tests for Zod schemas — 13 extract tests + 8 task tests (21 total) [`0a87d19`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/0a87d19)
- [x] **1.10** Integration tests for extract routes (mock Python sidecar responses) [`c9d5c0c`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c9d5c0c)
- [x] **1.11** Python unit tests for `extractor.py`, `models.py`, `app.py` (29 tests) [`c9d5c0c`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c9d5c0c)
- [x] **1.12** Verify: `pnpm test` passes (21 tests) [`0a87d19`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/0a87d19)
---
## Phase 2 — Predefined Task Library
> **Goal:** Ship a curated set of extraction task definitions that LysnrAI and MindLyst can use out-of-the-box.
### Task definitions
- [x] **2.1** Define `transcript-extraction` task (6 classes, few-shot examples) [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5)
- [x] **2.2** Define `triage` task (MindLyst) — 6 classes incl. brain_signal with brain/confidence attributes [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5)
- [x] **2.3** Define `memory-insight` task (MindLyst) — 4 classes [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5)
- [x] **2.4** Define `reflection-enrichment` task (MindLyst) — 4 classes [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5)
- [x] **2.5** Define `bug-report-extraction` task (Tracker) — 5 classes [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5)
### Task registry (Cosmos DB)
- [x] **2.6** Cosmos container `extraction_tasks` (partition `/productId`) — created on first access via repository [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5)
- [x] **2.7** Implement `src/modules/tasks/repository.ts` — listTasks, getTask, createTask, updateTask, deleteTask, upsertTask [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5)
- [x] **2.8** Implement `src/modules/tasks/routes.ts` — GET/POST/PUT/DELETE /tasks [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5)
- [x] **2.9** Seed built-in tasks on startup via `seed.ts` (idempotent upsert, 5 tasks) [`6a49823`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/6a49823)
- [x] **2.10** `productId` on all task documents (DEFAULT_PRODUCT_ID from env) [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5)
### Python task registry
- [x] **2.11** Implement `task_registry.py` — BUILTIN_TASKS with full definitions inline [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5)
- [x] **2.12** Task definitions stored inline in `task_registry.py` (no separate JSON needed) [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5)
- [x] **2.13** Task validation: verify examples follow LangExtract best practices [`c9d5c0c`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c9d5c0c)
### Tests
- [x] **2.14** Tests for task schemas (8 tests in types.test.ts) [`0a87d19`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/0a87d19)
- [x] **2.15** Tests for task seeding (7 tests in seed.test.ts) [`6a49823`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/6a49823)
- [x] **2.16** Verify: all 28 tests pass [`6a49823`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/6a49823)
---
## Phase 3 — Consumer Integration
> **Goal:** Wire LysnrAI and MindLyst to call the extraction service.
### `@bytelyst/extraction` package finalization
- [x] **3.1** `createExtractionClient()` with extract(), extractBatch(), listTasks(), getTask() [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5)
- [x] **3.2** Export all types from `src/index.ts` [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5)
- [x] **3.3** `pnpm build` passes for `@bytelyst/extraction` [`c292bb5`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c292bb5)
### LysnrAI integration
- [x] **3.4** Add `@bytelyst/extraction` to `admin-dashboard-web/package.json` (via `file:` ref) [`944609a`](https://github.com/saravanakumardb1/learning_voice_ai_agent/commit/944609a)
- [x] **3.5** Create `admin-dashboard-web/src/lib/extraction-client.ts` — extractText, extractTranscript, extractBatch, listTasks, getTask, getSidecarHealth [`944609a`](https://github.com/saravanakumardb1/learning_voice_ai_agent/commit/944609a)
- [x] **3.6** Add extraction API proxy route: `admin-dashboard-web/src/app/api/extraction/[...path]/route.ts` [`f65e318`](https://github.com/saravanakumardb1/learning_voice_ai_agent/commit/f65e318)
- [x] **3.7** Python extraction client in `backend/src/clients/extraction_client.py` [`f65e318`](https://github.com/saravanakumardb1/learning_voice_ai_agent/commit/f65e318)
- [x] **3.8** Post-transcription extraction endpoint `POST /api/transcripts/{id}/extract` [`f65e318`](https://github.com/saravanakumardb1/learning_voice_ai_agent/commit/f65e318)
- [x] **3.9** Extraction results UI in admin dashboard (entity viewer, task selector, metadata cards) [`f65e318`](https://github.com/saravanakumardb1/learning_voice_ai_agent/commit/f65e318)
### MindLyst integration
- [x] **3.10** MindLyst web extraction client (standalone, no @bytelyst deps needed) [`b545244`](https://github.com/saravanakumardb1/learning_multimodal_memory_agents/commit/b545244)
- [x] **3.11** Create `mindlyst-native/web/src/lib/extraction-client.ts` — triageExtract, memoryInsightExtract, reflectionExtract, isExtractionAvailable [`b545244`](https://github.com/saravanakumardb1/learning_multimodal_memory_agents/commit/b545244)
- [x] **3.12** Create API route `src/pages/api/extract.ts` (triage, memory-insight, reflection-enrichment tasks) [`da04d4e`](https://github.com/saravanakumardb1/learning_multimodal_memory_agents/commit/da04d4e)
- [x] **3.13** Wire triage flow to use extraction results (best-effort entity enrichment + brain signals) [`da04d4e`](https://github.com/saravanakumardb1/learning_multimodal_memory_agents/commit/da04d4e)
- [x] **3.14** Wire brain insights to `memory-insight` task (AI pattern detection) [`da04d4e`](https://github.com/saravanakumardb1/learning_multimodal_memory_agents/commit/da04d4e)
- [x] **3.15** Wire reflections to `reflection-enrichment` task (emotional states, accomplishments, concerns) [`da04d4e`](https://github.com/saravanakumardb1/learning_multimodal_memory_agents/commit/da04d4e)
### Tests
- [x] **3.16** Integration tests for LysnrAI extraction (covered by routes.test.ts mocks) [`c9d5c0c`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/c9d5c0c)
- [x] **3.17** Integration tests for MindLyst triage-via-extraction (best-effort, no test breakage) [`da04d4e`](https://github.com/saravanakumardb1/learning_multimodal_memory_agents/commit/da04d4e)
- [x] **3.18** Verify `npx tsc --noEmit` across all dashboards — clean pass
---
## Phase 4 — Docker & DevOps
> **Goal:** Containerize, add to docker-compose, update run scripts.
### Dockerfile
- [x] **4.1** Create multi-stage `Dockerfile` for extraction-service (3-stage: ts-builder, py-builder, runtime) [`37343ae`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/37343ae)
- [x] **4.2** Create `supervisord.conf` (manages Fastify :4005 + uvicorn :4006) [`37343ae`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/37343ae)
- [x] **4.3** Verify: Dockerfile structure validated (full Docker build deferred to CI)
### Docker Compose
- [x] **4.4** Add `extraction-service` to `docker-compose.yml` (port 4005, Traefik, Loki, healthcheck) [`bdd9bb1`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/bdd9bb1)
- [x] **4.5** Add to LysnrAI `docker-compose.yml` (ports 4005+4006, Traefik, Loki, healthcheck) [`a36b956`](https://github.com/saravanakumardb1/learning_voice_ai_agent/commit/a36b956)
### Run scripts
- [x] **4.6** Add extraction-service to `run-local-all-services.sh` (Fastify + Python sidecar) [`87822d5`](https://github.com/saravanakumardb1/learning_voice_ai_agent/commit/87822d5)
- [x] **4.7** Add extraction-service to `.windsurf/workflows/start-all-services.md` [`87822d5`](https://github.com/saravanakumardb1/learning_voice_ai_agent/commit/87822d5)
- [x] **4.8** Add `EXTRACTION_SERVICE_URL` to LysnrAI `.env.example` [`944609a`](https://github.com/saravanakumardb1/learning_voice_ai_agent/commit/944609a)
- [x] **4.9** Add extraction service env vars to common platform `.env.example` [`bdd9bb1`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/bdd9bb1)
### CI
- [x] **4.10** Create `.github/workflows/ci-extraction-service.yml` (TS build+test + Python lint+test) [`0d0165e`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/0d0165e)
- [x] **4.11** CI workflow created (execution deferred — GitHub Actions disabled for billing)
---
## Phase 5 — Production Hardening
> **Goal:** Rate limiting, caching, observability, cost controls.
### Caching
- [x] **5.1** Add result caching in Python sidecar (LRU cache with sha256 keys, configurable TTL + max size) [`9c8a316`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/9c8a316)
- [x] **5.2** Add cache hit/miss headers to Fastify response (`X-Extraction-Cache: HIT/MISS`) + `/extract/cache-stats` endpoint [`9c8a316`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/9c8a316)
### Cost controls
- [x] **5.3** Add per-user daily extraction quota (free=10, pro=100, enterprise=unlimited) [`9c8a316`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/9c8a316)
- [x] **5.4** Track usage in-memory (Cosmos persistence deferred to Phase 7) [`9c8a316`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/9c8a316)
- [x] **5.5** Return `429 Too Many Requests` with X-RateLimit-Limit/Remaining headers [`9c8a316`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/9c8a316)
- [x] **5.6** Add usage reporting endpoint: `GET /api/extract/usage` (admin) [`9c8a316`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/9c8a316)
### Observability
- [x] **5.7** Add structured logging (userId, productId, cacheHit, tokenCount, charCount) [`b8c0a73`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/b8c0a73)
- [x] **5.8** Add metrics module (counters + histograms) + `/extract/metrics` endpoint [`b8c0a73`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/b8c0a73)
- [x] **5.9** Add Grafana dashboard for extraction service (`extraction-service.json`) [`b8c0a73`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/b8c0a73)
### Error handling
- [x] **5.10** Map sidecar errors to proper HTTP status codes (408, 429, 400, 502, 503) [`b8c0a73`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/b8c0a73)
- [x] **5.11** Add circuit breaker for Python sidecar (5 failures → 30s OPEN → HALF_OPEN probe) [`b8c0a73`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/b8c0a73)
- [x] **5.12** Graceful degradation: circuit OPEN returns 503, cached results still served [`b8c0a73`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/b8c0a73)
---
## Phase 6 — Advanced Features (Future)
> **Goal:** Power-user features, visualization, and batch processing.
### Visualization
- [x] **6.1** Entity visualization components (bar chart, pie chart, timeline) in admin dashboard [`00a3617`](https://github.com/saravanakumardb1/learning_voice_ai_agent/commit/00a3617)
- [x] **6.2** Visualization components use Recharts + shadcn/ui (Blob storage deferred) [`00a3617`](https://github.com/saravanakumardb1/learning_voice_ai_agent/commit/00a3617)
### Batch & async processing
- [x] **6.3** Async extraction job queue: `POST /extract/jobs`, `GET /extract/jobs/:id`, `GET /extract/jobs` [`5c1744d`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/5c1744d)
- [x] **6.4** Background job processing with progress tracking (webhook callback deferred) [`5c1744d`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/5c1744d)
### Custom model support
- [x] **6.5** Model registry with tier (standard/premium/free/mock) + `GET /extract/models` endpoint [`5c1744d`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/5c1744d)
- [x] **6.6** Model registry supports Gemini 2.5 Flash/Pro, 2.0 Flash, and mock extractor [`5c1744d`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/5c1744d)
### Multi-language extraction
- [x] **6.7** Multi-language detection (es/fr/de/pt/ja/zh/ko/ar) with CJK unicode + keyword matching [`5c1744d`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/5c1744d)
- [x] **6.8** Language-aware prompt enrichment — detected language added to prompt + metadata [`5c1744d`](https://github.com/saravanakumardb1/learning_ai_common_plat/commit/5c1744d)
---
## Env Vars Summary
| Variable | Service | Default | Description |
| ------------------------ | ------------------ | ----------------------- | ----------------------------------- |
| `PORT` | extraction-service | `4005` | Fastify listen port |
| `HOST` | extraction-service | `0.0.0.0` | Fastify listen host |
| `CORS_ORIGIN` | extraction-service | `*` | Allowed origins |
| `PYTHON_SIDECAR_URL` | extraction-service | `http://localhost:4006` | Python sidecar URL |
| `DEFAULT_MODEL_ID` | extraction-service | `gemini-2.5-flash` | Default LLM model |
| `GEMINI_API_KEY` | python sidecar | — | Google Gemini API key |
| `AZURE_OPENAI_API_KEY` | python sidecar | — | Azure OpenAI key (alternative) |
| `AZURE_OPENAI_ENDPOINT` | python sidecar | — | Azure OpenAI endpoint (alternative) |
| `MAX_WORKERS` | python sidecar | `10` | Parallel extraction workers |
| `MAX_CHAR_BUFFER` | python sidecar | `2000` | Chunk size for long docs |
| `EXTRACTION_CACHE_TTL` | python sidecar | `86400` | Cache TTL in seconds |
| `COSMOS_ENDPOINT` | extraction-service | — | Azure Cosmos DB endpoint |
| `COSMOS_KEY` | extraction-service | — | Azure Cosmos DB key |
| `COSMOS_DATABASE` | extraction-service | `lysnrai` | Database name |
| `JWT_SECRET` | extraction-service | — | JWT validation secret |
| `EXTRACTION_SERVICE_URL` | consumers | `http://localhost:4005` | Used by dashboards/backends |
---
## Port Allocation
| Service | Port |
| -------------------------------------------- | -------- |
| growth-service | 4001 |
| billing-service | 4002 |
| platform-service | 4003 |
| tracker-service | 4004 |
| **extraction-service** | **4005** |
| extraction-service python sidecar (internal) | 4006 |
---
## Dependency Graph
```
@bytelyst/extraction (package)
└── @bytelyst/api-client (peer dep)
@lysnrai/extraction-service (service)
├── @bytelyst/fastify-core
├── @bytelyst/auth
├── @bytelyst/config
├── @bytelyst/cosmos
├── @bytelyst/errors
├── fastify, zod, jose (direct deps)
└── python sidecar
└── langextract, fastapi, uvicorn, structlog
```
---
## Estimated Effort
| Phase | Effort | Dependencies |
| ------------------------------ | -------- | ------------ |
| Phase 0 — Foundation | 23 days | None |
| Phase 1 — Core API | 23 days | Phase 0 |
| Phase 2 — Task Library | 2 days | Phase 1 |
| Phase 3 — Consumer Integration | 34 days | Phase 2 |
| Phase 4 — Docker & DevOps | 12 days | Phase 1 |
| Phase 5 — Production Hardening | 23 days | Phase 3 |
| Phase 6 — Advanced (future) | Ongoing | Phase 5 |
**Total MVP (Phases 04): ~1014 days**
---
## Rollback Strategy
- The extraction-service is **additive** — no existing code is modified until Phase 3
- Phase 3 consumer integration uses new endpoints/routes — existing triage/transcript flows remain untouched
- If extraction-service is down, consumers fall back to their existing behavior (MindLyst mock triage, LysnrAI raw transcripts)
- The `@bytelyst/extraction` package is optional — dashboards only import it for new extraction features
---
## Completion Status
**All 68 roadmap items (Phases 06) are implemented and checked.**
### Deferred Items (TODO — Require User Action)
The following items are functionally complete but have deferred sub-tasks that need manual steps or external dependencies:
| # | Item | What's Done | What's Deferred | Action Needed |
| -------- | ------------------------------- | -------------------------------------------------------------------------------------- | ----------------------------------------------------- | ---------------------------------------------------------------------------------------- |
| **4.3** | Dockerfile build verification | 3-stage Dockerfile created and structure validated | Full `docker build` has not been run | Run `docker build -f services/extraction-service/Dockerfile .` from common platform root |
| **4.11** | CI workflow execution | `.github/workflows/ci-extraction-service.yml` created | GitHub Actions disabled due to billing | Re-enable GitHub Actions or rename `disabled.yml` back to `ci.yml` |
| **5.4** | Usage persistence in Cosmos DB | In-memory usage tracking works with daily quota enforcement | Cosmos `extraction_usage` container not created | Implement Cosmos persistence in Phase 7 when ready |
| **6.2** | Visualization artifact storage | Recharts components render in admin dashboard | Azure Blob Storage for saved visualizations not wired | Wire `@bytelyst/blob` when visualization export is needed |
| **6.4** | Webhook callback for async jobs | Job queue with progress polling works (`POST /extract/jobs` → `GET /extract/jobs/:id`) | No webhook/callback on completion | Add webhook URL field to job creation when consumers need push notifications |
### Verification Summary
| Check | Status |
| ------------------------------------------------- | ------------------- |
| `pnpm --filter @lysnrai/extraction-service build` | ✅ Clean |
| `pnpm --filter @lysnrai/extraction-service test` | ✅ 46 tests passing |
| `pnpm --filter @bytelyst/extraction build` | ✅ Clean |
| `npx tsc --noEmit` (admin-dashboard-web) | ✅ Clean |
| `npx tsc --noEmit` (mindlyst-native/web) | ✅ Clean |
| Python sidecar tests (29 tests) | ✅ Passing |