learning_ai_common_plat/docs/EXTRACTION_SERVICE_ROADMAP.md
2026-02-14 13:22:25 -08:00

24 KiB
Raw Blame History

Extraction Service — Roadmap & Task Checklist

Service: @lysnrai/extraction-service (port 4005) Package: @bytelyst/extraction (shared types + client) Core dependency: google/langextract (Python)

Companion docs: ECOSYSTEM_ARCHITECTURE.md · ROADMAP.md


Overview

A shared extraction microservice that uses Google's LangExtract library to extract structured information from unstructured text. Both LysnrAI and MindLyst consume this service for their respective extraction needs.

Architecture: Fastify (routing, auth, validation, request tracing) + Python sidecar (LangExtract). The Fastify layer keeps the service consistent with the other 4 services. The Python process handles the actual LLM-powered extraction.

┌──────────────────────────────────────────────────────────┐
│                   extraction-service                      │
│                      (port 4005)                          │
│                                                           │
│  ┌─────────────────────┐    ┌──────────────────────────┐ │
│  │   Fastify (TS)      │    │   Python Sidecar         │ │
│  │                     │    │                          │ │
│  │  - Auth middleware   │──►│  - LangExtract wrapper   │ │
│  │  - Zod validation   │◄──│  - Task registry         │ │
│  │  - x-request-id     │    │  - Model provider config │ │
│  │  - Rate limiting    │    │  - Result caching        │ │
│  │  - /health          │    │                          │ │
│  └─────────────────────┘    └──────────────────────────┘ │
└──────────────────────────────────────────────────────────┘
        ▲                              ▲
        │                              │
   REST API                     FastAPI (internal :4006)
   (external)                   or subprocess stdio

Consumers

Product Use Case Entry Point
LysnrAI — Desktop/Backend Post-transcription extraction (action items, decisions, dates, people) backend/src/clients/extraction_client.py
LysnrAI — Admin Dashboard Transcript analytics, entity review admin-dashboard-web/src/lib/extraction-client.ts
MindLyst — KMP/Web Triage pipeline (brain routing, entity extraction, topic classification) mindlyst-native/web/src/pages/api/triage.ts
MindLyst — Web Dashboard Brain insight generation, reflection enrichment Direct API calls via @bytelyst/api-client

Phase 0 — Foundation & Scaffolding

Goal: Set up the service skeleton, Python environment, and build pipeline.

Service scaffold (Fastify)

  • 0.1 Create services/extraction-service/ directory structure:
    services/extraction-service/
      src/
        lib/
          config.ts            # Zod config schema (PORT, HOST, CORS, PYTHON_SIDECAR_URL, etc.)
          errors.ts            # Re-export from @bytelyst/errors
          cosmos.ts            # Re-export from @bytelyst/cosmos (for task registry persistence)
          product-config.ts    # Re-export from @bytelyst/config
          python-bridge.ts     # HTTP client to Python sidecar
        modules/
          extract/
            types.ts           # Zod schemas: ExtractionTask, ExtractionExample, ExtractionResult
            routes.ts          # POST /api/extract, POST /api/extract/batch, GET /api/tasks
          tasks/
            types.ts           # Predefined task definitions (triage, transcript, etc.)
            repository.ts      # Cosmos CRUD for custom task definitions
            routes.ts          # CRUD endpoints for task management
        server.ts              # createServiceApp + route registration
      package.json
      tsconfig.json
      Dockerfile
    
  • 0.2 Create package.json (@lysnrai/extraction-service, port 4005) matching existing service conventions
  • 0.3 Create tsconfig.json extending ../../tsconfig.base.json
  • 0.4 Create src/lib/config.ts with Zod schema:
    • PORT (default 4005), HOST, CORS_ORIGIN
    • PYTHON_SIDECAR_URL (default http://localhost:4006)
    • DEFAULT_MODEL_ID (default gemini-2.5-flash)
    • GEMINI_API_KEY or AZURE_OPENAI_API_KEY / AZURE_OPENAI_ENDPOINT
    • MAX_WORKERS (default 10), MAX_CHAR_BUFFER (default 2000)
    • COSMOS_ENDPOINT, COSMOS_KEY, COSMOS_DATABASE, JWT_SECRET
  • 0.5 Create src/server.ts using createServiceApp() + startService() from @bytelyst/fastify-core
  • 0.6 Add .env.example with all required env vars
  • 0.7 Verify: pnpm build passes for the new service

Python sidecar scaffold

  • 0.8 Create services/extraction-service/python/ directory:
    python/
      src/
        __init__.py
        app.py                 # FastAPI app (internal, port 4006)
        extractor.py           # LangExtract wrapper
        task_registry.py       # Built-in task definitions
        models.py              # Pydantic models matching TS Zod schemas
      requirements.txt         # langextract, fastapi, uvicorn, pydantic
      Dockerfile               # Python 3.12 slim
    
  • 0.9 Create python/requirements.txt:
    langextract>=0.3.0
    fastapi>=0.115.0
    uvicorn>=0.34.0
    pydantic>=2.10.0
    pydantic-settings>=2.7.0
    structlog>=24.4.0
    
  • 0.10 Create python/src/app.py — FastAPI app with endpoints:
    • POST /extract — single document extraction
    • POST /extract/batch — batch extraction
    • GET /health — sidecar health check
  • 0.11 Create python/src/extractor.py — wrapper around lx.extract() with configurable model provider
  • 0.12 Verify: Python sidecar starts and /health returns OK

Package scaffold (@bytelyst/extraction)

  • 0.13 Create packages/extraction/ directory:
    packages/extraction/
      src/
        index.ts               # Public API
        types.ts               # Shared TypeScript types
        client.ts              # createExtractionClient() factory
      package.json
      tsconfig.json
    
  • 0.14 Create package.json (@bytelyst/extraction) with @bytelyst/api-client as peer dep
  • 0.15 Define TypeScript types matching the extraction API:
    • ExtractionTask — prompt description + examples + model config
    • ExtractionExample — text + extractions (class, text, attributes)
    • ExtractionResult — extracted entities with source grounding
    • ExtractionRequest — task + input text/URL
    • ExtractionResponse — results + metadata (model, duration, token count)
  • 0.16 Create createExtractionClient() factory using createApiClient() pattern
  • 0.17 Verify: pnpm build passes for the new package

Workspace wiring

  • 0.18 Add extraction-service and extraction to pnpm-workspace.yaml (already covered by packages/* + services/* globs — verify)
  • 0.19 Run pnpm install from repo root — verify workspace resolution
  • 0.20 Verify: pnpm build and pnpm typecheck pass across entire repo

Phase 1 — Core Extraction API

Goal: Working extraction endpoint that accepts text + task definition and returns structured results via LangExtract.

Python extractor implementation

  • 1.1 Implement extractor.py:
    • Accept task definition (prompt, examples, model config)
    • Accept input text (string or URL)
    • Call lx.extract() with configurable parameters (model_id, extraction_passes, max_workers, max_char_buffer)
    • Return structured results with source grounding (extraction_class, extraction_text, attributes, char offsets)
    • Handle errors gracefully (model timeout, rate limit, invalid input)
  • 1.2 Implement model provider configuration:
    • Gemini (default): API key from env
    • Azure OpenAI: endpoint + key from env
    • Ollama (local dev): configurable base URL
  • 1.3 Add request/response logging via structlog (never print())
  • 1.4 Add request timeout configuration (default 120s for long documents)

Fastify routes

  • 1.5 Implement src/modules/extract/types.ts:
    • ExtractRequestSchema (Zod) — task definition + input text + options
    • ExtractResponseSchema (Zod) — array of extractions + metadata
    • BatchExtractRequestSchema — array of inputs + shared task
  • 1.6 Implement src/modules/extract/routes.ts:
    • POST /api/extract — auth required, validates input, proxies to Python sidecar
    • POST /api/extract/batch — auth required, accepts multiple inputs
    • GET /api/extract/models — list available model providers
  • 1.7 Implement src/lib/python-bridge.ts:
    • HTTP client to Python sidecar (fetch with timeout, retry, error mapping)
    • Health check polling on startup (wait for sidecar readiness)
    • Request ID forwarding (x-request-id)
  • 1.8 Add rate limiting to extraction endpoints (configurable per-user limit)

Tests

  • 1.9 Write unit tests for Zod schemas (types.test.ts)
  • 1.10 Write integration tests for extract routes (mock Python sidecar responses)
  • 1.11 Write Python unit tests for extractor.py (mock lx.extract)
  • 1.12 Verify: pnpm test passes, pytest passes

Phase 2 — Predefined Task Library

Goal: Ship a curated set of extraction task definitions that LysnrAI and MindLyst can use out-of-the-box.

Task definitions

  • 2.1 Define transcript-extraction task:
    • Classes: action_item, decision, question, deadline, person, topic
    • 35 few-shot examples from realistic meeting transcripts
    • Default model: gemini-2.5-flash
  • 2.2 Define triage task (MindLyst):
    • Classes: topic, entity, action, emotion, date_reference, brain_signal
    • brain_signal attributes: { brain: "work|home|money|health|global", confidence: float }
    • 35 few-shot examples per brain type
  • 2.3 Define memory-insight task (MindLyst):
    • Classes: pattern, recurring_theme, relationship, milestone
    • Examples from accumulated brain memories
  • 2.4 Define reflection-enrichment task (MindLyst):
    • Classes: emotional_state, accomplishment, concern, goal_progress
    • Examples from journal-style text
  • 2.5 Define bug-report-extraction task (Tracker):
    • Classes: steps_to_reproduce, expected_behavior, actual_behavior, affected_component, severity
    • Examples from real issue submissions

Task registry (Cosmos DB)

  • 2.6 Create Cosmos container: extraction_tasks (partition key: /productId)
  • 2.7 Implement src/modules/tasks/repository.ts — CRUD for task definitions
  • 2.8 Implement src/modules/tasks/routes.ts:
    • GET /api/tasks — list all tasks (built-in + custom)
    • GET /api/tasks/:id — get task by ID
    • POST /api/tasks — create custom task (admin only)
    • PUT /api/tasks/:id — update task (admin only)
    • DELETE /api/tasks/:id — delete custom task (admin only)
  • 2.9 Seed built-in tasks on service startup (idempotent upsert)
  • 2.10 Add productId to all task documents

Python task registry

  • 2.11 Implement task_registry.py — load task definitions from Cosmos (via Fastify API) or local JSON fallback
  • 2.12 Create python/tasks/ directory with JSON files for each built-in task
  • 2.13 Add task validation: verify examples follow LangExtract best practices (ordered, verbatim, no overlap)

Tests

  • 2.14 Write tests for task CRUD routes
  • 2.15 Write tests for task seeding logic
  • 2.16 Verify: all tests pass

Phase 3 — Consumer Integration

Goal: Wire LysnrAI and MindLyst to call the extraction service.

@bytelyst/extraction package finalization

  • 3.1 Add typed methods to createExtractionClient():
    • extract(input, taskId, options?) — single extraction
    • extractBatch(inputs, taskId, options?) — batch extraction
    • listTasks() — get available tasks
    • getTask(id) — get task details
  • 3.2 Export all types from src/index.ts
  • 3.3 Publish: pnpm build in packages/extraction/

LysnrAI integration

  • 3.4 Add @bytelyst/extraction to admin-dashboard-web/package.json (via file: ref)
  • 3.5 Create admin-dashboard-web/src/lib/extraction-client.ts — typed client instance
  • 3.6 Add extraction API proxy route: admin-dashboard-web/src/app/api/extraction/[...path]/route.ts
  • 3.7 Create Python extraction client in backend/src/clients/extraction_client.py:
    • HTTP client to extraction-service (port 4005)
    • Methods: extract_transcript(text), extract_batch(texts)
  • 3.8 Add post-transcription extraction to LysnrAI backend:
    • New endpoint: POST /api/transcripts/{id}/extract
    • Calls extraction-service with transcript-extraction task
    • Stores results alongside transcript
  • 3.9 Add extraction results display to admin dashboard (transcript detail page)

MindLyst integration

  • 3.10 Add @bytelyst/extraction to mindlyst-native/web/package.json (via file: ref):
    "@bytelyst/extraction": "file:../../../learning_ai_common_plat/packages/extraction"
    
  • 3.11 Create mindlyst-native/web/src/lib/extraction-client.ts
  • 3.12 Create API route: mindlyst-native/web/src/pages/api/extract.ts
    • Accepts raw capture text, calls extraction-service with triage task
    • Returns brain routing + extracted entities
  • 3.13 Update triage flow on web dashboard to use extraction results for brain auto-routing
  • 3.14 Wire brain insight generation to use memory-insight task
  • 3.15 Wire reflection enrichment to use reflection-enrichment task

Tests

  • 3.16 Add integration tests for LysnrAI extraction endpoint
  • 3.17 Add integration tests for MindLyst triage-via-extraction flow
  • 3.18 Verify: npx tsc --noEmit passes in all 3 dashboards + MindLyst web

Phase 4 — Docker & DevOps

Goal: Containerize, add to docker-compose, update run scripts.

Dockerfile

  • 4.1 Create multi-stage Dockerfile for extraction-service:
    • Stage 1: Node.js build (Fastify TS → JS)
    • Stage 2: Python setup (install langextract + deps)
    • Stage 3: Runtime (Node.js + Python, supervisord to run both processes)
  • 4.2 Create supervisord.conf to manage Fastify (port 4005) + Python sidecar (port 4006)
  • 4.3 Verify: docker build succeeds

Docker Compose

  • 4.4 Add extraction-service to docker-compose.yml:
    extraction-service:
      build:
        context: .
        dockerfile: services/extraction-service/Dockerfile
      ports:
        - '4005:4005'
      env_file:
        - .env
      environment:
        - PORT=4005
        - PYTHON_SIDECAR_URL=http://localhost:4006
      labels:
        - 'traefik.enable=true'
        - 'traefik.http.routers.extraction.rule=PathPrefix(`/api/extract`) || PathPrefix(`/api/tasks`)'
        - 'traefik.http.services.extraction.loadbalancer.server.port=4005'
      logging:
        driver: loki
        options:
          loki-url: 'http://host.docker.internal:3100/loki/api/v1/push'
          loki-retries: '3'
      restart: unless-stopped
      healthcheck:
        test: ['CMD', 'wget', '--no-verbose', '--tries=1', '--spider', 'http://localhost:4005/health']
        interval: 30s
        timeout: 10s
        retries: 3
    
  • 4.5 Add to LysnrAI docker-compose.yml (references ../learning_ai_common_plat/services/extraction-service/)

Run scripts

  • 4.6 Add extraction-service to run-local-all-services.sh in LysnrAI repo
  • 4.7 Add extraction-service to .windsurf/workflows/start-all-services.md
  • 4.8 Add .env.example entries to LysnrAI repo root (EXTRACTION_SERVICE_URL=http://localhost:4005)
  • 4.9 Add .env.example entries to MindLyst web (same)

CI

  • 4.10 Create .github/workflows/ci-extraction-service.yml:
    • Trigger: push to services/extraction-service/** or packages/extraction/**
    • Steps: pnpm install, pnpm build, pnpm test (TS), pip install + pytest (Python)
  • 4.11 Verify: CI workflow passes

Phase 5 — Production Hardening

Goal: Rate limiting, caching, observability, cost controls.

Caching

  • 5.1 Add result caching in Python sidecar:
    • Cache key: hash(task_id + input_text + model_id)
    • TTL: configurable (default 24h)
    • Storage: in-memory LRU (dev) or Redis (prod)
  • 5.2 Add cache hit/miss headers to Fastify response (X-Extraction-Cache: HIT/MISS)

Cost controls

  • 5.3 Add per-user daily extraction quota (configurable per plan tier):
    • Free: 10 extractions/day
    • Pro: 100 extractions/day
    • Enterprise: unlimited
  • 5.4 Track usage in Cosmos extraction_usage container (partition: /userId)
  • 5.5 Return 429 Too Many Requests with quota info when exceeded
  • 5.6 Add usage reporting endpoint: GET /api/extract/usage (admin)

Observability

  • 5.7 Add structured logging for every extraction:
    • Request: task_id, input_length, model_id, user_id, product_id
    • Response: entity_count, duration_ms, token_count, cache_hit
  • 5.8 Add Prometheus metrics (via fastify-metrics):
    • extraction_requests_total (labels: task_id, model_id, product_id, status)
    • extraction_duration_seconds (histogram)
    • extraction_entities_extracted (histogram)
    • extraction_cache_hit_total
  • 5.9 Add Grafana dashboard for extraction service (in services/monitoring/grafana/dashboards/)

Error handling

  • 5.10 Map LangExtract errors to @bytelyst/errors:
    • Model timeout → 408 Request Timeout
    • Rate limit (upstream LLM) → 429 Too Many Requests with retry-after
    • Invalid task definition → 400 Bad Request
    • Model unavailable → 503 Service Unavailable
  • 5.11 Add circuit breaker for Python sidecar (fail fast if sidecar is down)
  • 5.12 Add graceful degradation: return partial results if some chunks fail

Phase 6 — Advanced Features (Future)

Goal: Power-user features, visualization, and batch processing.

Visualization

  • 6.1 Expose LangExtract's HTML visualization:
    • GET /api/extract/:requestId/visualization — returns interactive HTML
    • Embed in admin dashboard for extraction quality review
  • 6.2 Store visualization artifacts in Azure Blob Storage (extractions container)

Batch & async processing

  • 6.3 Add async extraction endpoint:
    • POST /api/extract/async — returns job ID immediately
    • GET /api/extract/jobs/:id — poll for status + results
    • Webhook callback when complete
  • 6.4 Add Vertex AI batch processing support (for high-volume MindLyst triage)

Custom model support

  • 6.5 Add Ollama provider for local/air-gapped deployments
  • 6.6 Add model benchmarking endpoint: run same task across models, compare quality + cost

Multi-language extraction

  • 6.7 Test and validate extraction across languages (LangExtract supports multi-language via LLM)
  • 6.8 Add language detection to extraction pipeline (auto-detect input language)

Env Vars Summary

Variable Service Default Description
PORT extraction-service 4005 Fastify listen port
HOST extraction-service 0.0.0.0 Fastify listen host
CORS_ORIGIN extraction-service * Allowed origins
PYTHON_SIDECAR_URL extraction-service http://localhost:4006 Python sidecar URL
DEFAULT_MODEL_ID extraction-service gemini-2.5-flash Default LLM model
GEMINI_API_KEY python sidecar Google Gemini API key
AZURE_OPENAI_API_KEY python sidecar Azure OpenAI key (alternative)
AZURE_OPENAI_ENDPOINT python sidecar Azure OpenAI endpoint (alternative)
MAX_WORKERS python sidecar 10 Parallel extraction workers
MAX_CHAR_BUFFER python sidecar 2000 Chunk size for long docs
EXTRACTION_CACHE_TTL python sidecar 86400 Cache TTL in seconds
COSMOS_ENDPOINT extraction-service Azure Cosmos DB endpoint
COSMOS_KEY extraction-service Azure Cosmos DB key
COSMOS_DATABASE extraction-service lysnrai Database name
JWT_SECRET extraction-service JWT validation secret
EXTRACTION_SERVICE_URL consumers http://localhost:4005 Used by dashboards/backends

Port Allocation

Service Port
growth-service 4001
billing-service 4002
platform-service 4003
tracker-service 4004
extraction-service 4005
extraction-service python sidecar (internal) 4006

Dependency Graph

@bytelyst/extraction (package)
  └── @bytelyst/api-client (peer dep)

@lysnrai/extraction-service (service)
  ├── @bytelyst/fastify-core
  ├── @bytelyst/auth
  ├── @bytelyst/config
  ├── @bytelyst/cosmos
  ├── @bytelyst/errors
  ├── fastify, zod, jose (direct deps)
  └── python sidecar
      └── langextract, fastapi, uvicorn, structlog

Estimated Effort

Phase Effort Dependencies
Phase 0 — Foundation 23 days None
Phase 1 — Core API 23 days Phase 0
Phase 2 — Task Library 2 days Phase 1
Phase 3 — Consumer Integration 34 days Phase 2
Phase 4 — Docker & DevOps 12 days Phase 1
Phase 5 — Production Hardening 23 days Phase 3
Phase 6 — Advanced (future) Ongoing Phase 5

Total MVP (Phases 04): ~1014 days


Rollback Strategy

  • The extraction-service is additive — no existing code is modified until Phase 3
  • Phase 3 consumer integration uses new endpoints/routes — existing triage/transcript flows remain untouched
  • If extraction-service is down, consumers fall back to their existing behavior (MindLyst mock triage, LysnrAI raw transcripts)
  • The @bytelyst/extraction package is optional — dashboards only import it for new extraction features