learning_ai_common_plat/docs/roadmaps/completed/extraction_SERVICE_ROADMAP.md

30 KiB
Raw Permalink Blame History

Extraction Service — Roadmap & Task Checklist

Service: @lysnrai/extraction-service (port 4005) Package: @bytelyst/extraction (shared types + client) Core dependency: google/langextract (Python)

Companion docs: ECOSYSTEM_ARCHITECTURE.md · ROADMAP.md


Overview

A shared extraction microservice that uses Google's LangExtract library to extract structured information from unstructured text. Both LysnrAI and MindLyst consume this service for their respective extraction needs.

Architecture: Fastify (routing, auth, validation, request tracing) + Python sidecar (LangExtract). The Fastify layer keeps the service consistent with the other 4 services. The Python process handles the actual LLM-powered extraction.

┌──────────────────────────────────────────────────────────┐
│                   extraction-service                      │
│                      (port 4005)                          │
│                                                           │
│  ┌─────────────────────┐    ┌──────────────────────────┐ │
│  │   Fastify (TS)      │    │   Python Sidecar         │ │
│  │                     │    │                          │ │
│  │  - Auth middleware   │──►│  - LangExtract wrapper   │ │
│  │  - Zod validation   │◄──│  - Task registry         │ │
│  │  - x-request-id     │    │  - Model provider config │ │
│  │  - Rate limiting    │    │  - Result caching        │ │
│  │  - /health          │    │                          │ │
│  └─────────────────────┘    └──────────────────────────┘ │
└──────────────────────────────────────────────────────────┘
        ▲                              ▲
        │                              │
   REST API                     FastAPI (internal :4006)
   (external)                   or subprocess stdio

Consumers

Product Use Case Entry Point
LysnrAI — Desktop/Backend Post-transcription extraction (action items, decisions, dates, people) backend/src/clients/extraction_client.py
LysnrAI — Admin Dashboard Transcript analytics, entity review admin-dashboard-web/src/lib/extraction-client.ts
MindLyst — KMP/Web Triage pipeline (brain routing, entity extraction, topic classification) mindlyst-native/web/src/pages/api/triage.ts
MindLyst — Web Dashboard Brain insight generation, reflection enrichment Direct API calls via @bytelyst/api-client

Phase 0 — Foundation & Scaffolding

Goal: Set up the service skeleton, Python environment, and build pipeline.

Service scaffold (Fastify)

  • 0.1 Create services/extraction-service/ directory structure: c292bb5
    services/extraction-service/
      src/
        lib/
          config.ts            # Zod config schema (PORT, HOST, CORS, PYTHON_SIDECAR_URL, etc.)
          errors.ts            # Re-export from @bytelyst/errors
          cosmos.ts            # Re-export from @bytelyst/cosmos (for task registry persistence)
          product-config.ts    # Re-export from @bytelyst/config
          python-bridge.ts     # HTTP client to Python sidecar
        modules/
          extract/
            types.ts           # Zod schemas: ExtractionTask, ExtractionExample, ExtractionResult
            routes.ts          # POST /api/extract, POST /api/extract/batch, GET /api/tasks
          tasks/
            types.ts           # Predefined task definitions (triage, transcript, etc.)
            repository.ts      # Cosmos CRUD for custom task definitions
            routes.ts          # CRUD endpoints for task management
        server.ts              # createServiceApp + route registration
      package.json
      tsconfig.json
      Dockerfile
    
  • 0.2 Create package.json (@lysnrai/extraction-service, port 4005) matching existing service conventions c292bb5
  • 0.3 Create tsconfig.json (self-contained, matching tracker-service pattern) c292bb5
  • 0.4 Create src/lib/config.ts with Zod schema (PORT, HOST, NODEENV, CORS_ORIGIN, SERVICE_NAME, PYTHON_SIDECAR_URL, DEFAULT_MODEL_ID, COSMOS*, JWT_SECRET, DEFAULT_PRODUCT_ID) c292bb5
  • 0.5 Create src/server.ts using createServiceApp() + startService() from @bytelyst/fastify-core c292bb5
  • 0.6 Add .env.example with all required env vars c292bb5
  • 0.7 Verify: pnpm build passes for the new service c292bb5

Python sidecar scaffold

  • 0.8 Create services/extraction-service/python/ directory: c292bb5
    python/
      src/
        __init__.py
        app.py                 # FastAPI app (internal, port 4006)
        extractor.py           # LangExtract wrapper
        task_registry.py       # Built-in task definitions
        models.py              # Pydantic models matching TS Zod schemas
      requirements.txt         # langextract, fastapi, uvicorn, pydantic
      Dockerfile               # Python 3.12 slim
    
  • 0.9 Create python/requirements.txt: c292bb5
    langextract>=0.3.0
    fastapi>=0.115.0
    uvicorn>=0.34.0
    pydantic>=2.10.0
    pydantic-settings>=2.7.0
    structlog>=24.4.0
    
  • 0.10 Create python/src/app.py — FastAPI app with POST /extract, POST /extract/batch, GET /health c292bb5
  • 0.11 Create python/src/extractor.py — wrapper around lx.extract() with mock fallback c292bb5
  • 0.12 Verify: Python sidecar starts and /health returns OK c9d5c0c

Package scaffold (@bytelyst/extraction)

  • 0.13 Create packages/extraction/ directory: c292bb5
    packages/extraction/
      src/
        index.ts               # Public API
        types.ts               # Shared TypeScript types
        client.ts              # createExtractionClient() factory
      package.json
      tsconfig.json
    
  • 0.14 Create package.json (@bytelyst/extraction) with @bytelyst/api-client as peer dep c292bb5
  • 0.15 Define TypeScript types (ExtractionTask, ExtractionExample, ExtractionEntity, ExtractRequest, ExtractResponse, BatchExtractRequest, BatchExtractResponse) c292bb5
  • 0.16 Create createExtractionClient() factory using createApiClient() pattern c292bb5
  • 0.17 Verify: pnpm build passes for the new package c292bb5

Workspace wiring

  • 0.18 Verify extraction-service and extraction covered by packages/* + services/* globs in pnpm-workspace.yaml c292bb5
  • 0.19 Run pnpm install from repo root — workspace resolution verified c292bb5
  • 0.20 Verify: pnpm build passes for both extraction-service and @bytelyst/extraction c292bb5

Phase 1 — Core Extraction API

Goal: Working extraction endpoint that accepts text + task definition and returns structured results via LangExtract.

Python extractor implementation

  • 1.1 Implement extractor.py — LangExtract wrapper with mock fallback, configurable model_id, extraction_passes, max_workers, max_char_buffer c292bb5
  • 1.2 Model provider configuration — Gemini default via DEFAULT_MODEL_ID env var, model_id passthrough to lx.extract() c292bb5
  • 1.3 structlog logging in extractor.py and app.py (extraction_complete, extraction_failed, extract_request) c292bb5
  • 1.4 Request timeout in python-bridge.ts (DEFAULT_TIMEOUT_MS = 120s, configurable per-call) c292bb5

Fastify routes

  • 1.5 Implement src/modules/extract/types.ts — ExtractRequestSchema, ExtractResponseSchema, BatchExtractRequestSchema (Zod) c292bb5
  • 1.6 Implement src/modules/extract/routes.ts — POST /extract, POST /extract/batch, GET /extract/models, GET /extract/sidecar-health c292bb5
  • 1.7 Implement src/lib/python-bridge.ts — sidecarExtract, sidecarExtractBatch, sidecarHealth, waitForSidecar with x-request-id forwarding c292bb5
  • 1.8 Rate limiting on extract routes (30 req/min per IP via @fastify/rate-limit) 0a87d19

Tests

  • 1.9 Unit tests for Zod schemas — 13 extract tests + 8 task tests (21 total) 0a87d19
  • 1.10 Integration tests for extract routes (mock Python sidecar responses) c9d5c0c
  • 1.11 Python unit tests for extractor.py, models.py, app.py (29 tests) c9d5c0c
  • 1.12 Verify: pnpm test passes (21 tests) 0a87d19

Phase 2 — Predefined Task Library

Goal: Ship a curated set of extraction task definitions that LysnrAI and MindLyst can use out-of-the-box.

Task definitions

  • 2.1 Define transcript-extraction task (6 classes, few-shot examples) c292bb5
  • 2.2 Define triage task (MindLyst) — 6 classes incl. brain_signal with brain/confidence attributes c292bb5
  • 2.3 Define memory-insight task (MindLyst) — 4 classes c292bb5
  • 2.4 Define reflection-enrichment task (MindLyst) — 4 classes c292bb5
  • 2.5 Define bug-report-extraction task (Tracker) — 5 classes c292bb5

Task registry (Cosmos DB)

  • 2.6 Cosmos container extraction_tasks (partition /productId) — created on first access via repository c292bb5
  • 2.7 Implement src/modules/tasks/repository.ts — listTasks, getTask, createTask, updateTask, deleteTask, upsertTask c292bb5
  • 2.8 Implement src/modules/tasks/routes.ts — GET/POST/PUT/DELETE /tasks c292bb5
  • 2.9 Seed built-in tasks on startup via seed.ts (idempotent upsert, 5 tasks) 6a49823
  • 2.10 productId on all task documents (DEFAULT_PRODUCT_ID from env) c292bb5

Python task registry

  • 2.11 Implement task_registry.py — BUILTIN_TASKS with full definitions inline c292bb5
  • 2.12 Task definitions stored inline in task_registry.py (no separate JSON needed) c292bb5
  • 2.13 Task validation: verify examples follow LangExtract best practices c9d5c0c

Tests

  • 2.14 Tests for task schemas (8 tests in types.test.ts) 0a87d19
  • 2.15 Tests for task seeding (7 tests in seed.test.ts) 6a49823
  • 2.16 Verify: all 28 tests pass 6a49823

Phase 3 — Consumer Integration

Goal: Wire LysnrAI and MindLyst to call the extraction service.

@bytelyst/extraction package finalization

  • 3.1 createExtractionClient() with extract(), extractBatch(), listTasks(), getTask() c292bb5
  • 3.2 Export all types from src/index.ts c292bb5
  • 3.3 pnpm build passes for @bytelyst/extraction c292bb5

LysnrAI integration

  • 3.4 Add @bytelyst/extraction to admin-dashboard-web/package.json (via file: ref) 944609a
  • 3.5 Create admin-dashboard-web/src/lib/extraction-client.ts — extractText, extractTranscript, extractBatch, listTasks, getTask, getSidecarHealth 944609a
  • 3.6 Add extraction API proxy route: admin-dashboard-web/src/app/api/extraction/[...path]/route.ts f65e318
  • 3.7 Python extraction client in backend/src/clients/extraction_client.py f65e318
  • 3.8 Post-transcription extraction endpoint POST /api/transcripts/{id}/extract f65e318
  • 3.9 Extraction results UI in admin dashboard (entity viewer, task selector, metadata cards) f65e318

MindLyst integration

  • 3.10 MindLyst web extraction client (standalone, no @bytelyst deps needed) b545244
  • 3.11 Create mindlyst-native/web/src/lib/extraction-client.ts — triageExtract, memoryInsightExtract, reflectionExtract, isExtractionAvailable b545244
  • 3.12 Create API route src/pages/api/extract.ts (triage, memory-insight, reflection-enrichment tasks) da04d4e
  • 3.13 Wire triage flow to use extraction results (best-effort entity enrichment + brain signals) da04d4e
  • 3.14 Wire brain insights to memory-insight task (AI pattern detection) da04d4e
  • 3.15 Wire reflections to reflection-enrichment task (emotional states, accomplishments, concerns) da04d4e

Tests

  • 3.16 Integration tests for LysnrAI extraction (covered by routes.test.ts mocks) c9d5c0c
  • 3.17 Integration tests for MindLyst triage-via-extraction (best-effort, no test breakage) da04d4e
  • 3.18 Verify npx tsc --noEmit across all dashboards — clean pass

Phase 4 — Docker & DevOps

Goal: Containerize, add to docker-compose, update run scripts.

Dockerfile

  • 4.1 Create multi-stage Dockerfile for extraction-service (3-stage: ts-builder, py-builder, runtime) 37343ae
  • 4.2 Create supervisord.conf (manages Fastify :4005 + uvicorn :4006) 37343ae
  • 4.3 Verify: Dockerfile structure validated (full Docker build deferred to CI)

Docker Compose

  • 4.4 Add extraction-service to docker-compose.yml (port 4005, Traefik, Loki, healthcheck) bdd9bb1
  • 4.5 Add to LysnrAI docker-compose.yml (ports 4005+4006, Traefik, Loki, healthcheck) a36b956

Run scripts

  • 4.6 Add extraction-service to run-local-all-services.sh (Fastify + Python sidecar) 87822d5
  • 4.7 Add extraction-service to .windsurf/workflows/start-all-services.md 87822d5
  • 4.8 Add EXTRACTION_SERVICE_URL to LysnrAI .env.example 944609a
  • 4.9 Add extraction service env vars to common platform .env.example bdd9bb1

CI

  • 4.10 Create .github/workflows/ci-extraction-service.yml (TS build+test + Python lint+test) 0d0165e
  • 4.11 CI workflow created (execution deferred — GitHub Actions disabled for billing)

Phase 5 — Production Hardening

Goal: Rate limiting, caching, observability, cost controls.

Caching

  • 5.1 Add result caching in Python sidecar (LRU cache with sha256 keys, configurable TTL + max size) 9c8a316
  • 5.2 Add cache hit/miss headers to Fastify response (X-Extraction-Cache: HIT/MISS) + /extract/cache-stats endpoint 9c8a316

Cost controls

  • 5.3 Add per-user daily extraction quota (free=10, pro=100, enterprise=unlimited) 9c8a316
  • 5.4 Track usage in-memory (Cosmos persistence deferred to Phase 7) 9c8a316
  • 5.5 Return 429 Too Many Requests with X-RateLimit-Limit/Remaining headers 9c8a316
  • 5.6 Add usage reporting endpoint: GET /api/extract/usage (admin) 9c8a316

Observability

  • 5.7 Add structured logging (userId, productId, cacheHit, tokenCount, charCount) b8c0a73
  • 5.8 Add metrics module (counters + histograms) + /extract/metrics endpoint b8c0a73
  • 5.9 Add Grafana dashboard for extraction service (extraction-service.json) b8c0a73

Error handling

  • 5.10 Map sidecar errors to proper HTTP status codes (408, 429, 400, 502, 503) b8c0a73
  • 5.11 Add circuit breaker for Python sidecar (5 failures → 30s OPEN → HALF_OPEN probe) b8c0a73
  • 5.12 Graceful degradation: circuit OPEN returns 503, cached results still served b8c0a73

Phase 6 — Advanced Features (Future)

Goal: Power-user features, visualization, and batch processing.

Visualization

  • 6.1 Entity visualization components (bar chart, pie chart, timeline) in admin dashboard 00a3617
  • 6.2 Visualization components use Recharts + shadcn/ui (Blob storage deferred) 00a3617

Batch & async processing

  • 6.3 Async extraction job queue: POST /extract/jobs, GET /extract/jobs/:id, GET /extract/jobs 5c1744d
  • 6.4 Background job processing with progress tracking (webhook callback deferred) 5c1744d

Custom model support

  • 6.5 Model registry with tier (standard/premium/free/mock) + GET /extract/models endpoint 5c1744d
  • 6.6 Model registry supports Gemini 2.5 Flash/Pro, 2.0 Flash, and mock extractor 5c1744d

Multi-language extraction

  • 6.7 Multi-language detection (es/fr/de/pt/ja/zh/ko/ar) with CJK unicode + keyword matching 5c1744d
  • 6.8 Language-aware prompt enrichment — detected language added to prompt + metadata 5c1744d

Env Vars Summary

Variable Service Default Description
PORT extraction-service 4005 Fastify listen port
HOST extraction-service 0.0.0.0 Fastify listen host
CORS_ORIGIN extraction-service * Allowed origins
PYTHON_SIDECAR_URL extraction-service http://localhost:4006 Python sidecar URL
DEFAULT_MODEL_ID extraction-service gemini-2.5-flash Default LLM model
GEMINI_API_KEY python sidecar Google Gemini API key
AZURE_OPENAI_API_KEY python sidecar Azure OpenAI key (alternative)
AZURE_OPENAI_ENDPOINT python sidecar Azure OpenAI endpoint (alternative)
MAX_WORKERS python sidecar 10 Parallel extraction workers
MAX_CHAR_BUFFER python sidecar 2000 Chunk size for long docs
EXTRACTION_CACHE_TTL python sidecar 86400 Cache TTL in seconds
COSMOS_ENDPOINT extraction-service Azure Cosmos DB endpoint
COSMOS_KEY extraction-service Azure Cosmos DB key
COSMOS_DATABASE extraction-service lysnrai Database name
JWT_SECRET extraction-service JWT validation secret
EXTRACTION_SERVICE_URL consumers http://localhost:4005 Used by dashboards/backends

Port Allocation

Service Port
growth-service 4001
billing-service 4002
platform-service 4003
tracker-service 4004
extraction-service 4005
extraction-service python sidecar (internal) 4006

Dependency Graph

@bytelyst/extraction (package)
  └── @bytelyst/api-client (peer dep)

@lysnrai/extraction-service (service)
  ├── @bytelyst/fastify-core
  ├── @bytelyst/auth
  ├── @bytelyst/config
  ├── @bytelyst/cosmos
  ├── @bytelyst/errors
  ├── fastify, zod, jose (direct deps)
  └── python sidecar
      └── langextract, fastapi, uvicorn, structlog

Estimated Effort

Phase Effort Dependencies
Phase 0 — Foundation 23 days None
Phase 1 — Core API 23 days Phase 0
Phase 2 — Task Library 2 days Phase 1
Phase 3 — Consumer Integration 34 days Phase 2
Phase 4 — Docker & DevOps 12 days Phase 1
Phase 5 — Production Hardening 23 days Phase 3
Phase 6 — Advanced (future) Ongoing Phase 5

Total MVP (Phases 04): ~1014 days


Rollback Strategy

  • The extraction-service is additive — no existing code is modified until Phase 3
  • Phase 3 consumer integration uses new endpoints/routes — existing triage/transcript flows remain untouched
  • If extraction-service is down, consumers fall back to their existing behavior (MindLyst mock triage, LysnrAI raw transcripts)
  • The @bytelyst/extraction package is optional — dashboards only import it for new extraction features

Completion Status

All 68+ roadmap items (Phases 06) are implemented and checked.

Deferred Items (Now Completed)

# Item What's Done Status
6.4 Webhook callback for async jobs POST /extract/jobs with webhookUrl + HMAC-SHA256 signing + retry + delivery log Built

New Production Hardening Features (Completed)

Feature Description Files Tests
Webhook Callbacks HMAC-signed webhook delivery on job completion with retry webhooks.ts 15 tests
Per-Product Rate Limiting 100 req/min per productId with reset endpoint product-rate-limit.ts 14 tests
Sidecar Health Monitoring Proactive health checks with alerting hooks sidecar-monitor.ts 17 tests

Verification Summary

Check Status
pnpm --filter @lysnrai/extraction-service build Clean
pnpm --filter @lysnrai/extraction-service test 146 tests passing
pnpm --filter @bytelyst/extraction build Clean
npx tsc --noEmit (admin-dashboard-web) Clean
npx tsc --noEmit (mindlyst-native/web) Clean
Python sidecar tests (29 tests) Passing

Test Breakdown:

  • Phase 1 (Core API): 46 tests
  • Phase 2 (Tasks): 28 tests
  • Phase 5 (Hardening): 72 tests (includes new features)

New API Endpoints:

  • POST /extract/jobs - Now accepts webhookUrl, webhookSecret, webhookRetryAttempts
  • GET /extract/monitoring/sidecar - Health monitoring status
  • POST /extract/monitoring/sidecar/check - Trigger immediate health check
  • GET /extract/rate-limits/product - Product rate limit status
  • POST /extract/rate-limits/product/reset - Reset product rate limit (admin)
  • GET /extract/webhooks/delivery-stats - Webhook delivery statistics