learning_ai_common_plat/docs/WINDSURF/PLATFORM_COMPONENTS_ROADMAP.md
2026-02-17 10:49:14 -08:00

74 KiB
Raw Blame History

Platform Components Roadmap — What's Built, What's Missing, What's Next

Status: Living document — brainstorm + gap analysis
Last updated: 2026-02-17
Scope: All infrastructure components relevant to admin, DevOps, and product operations across the ByteLyst platform.
Repos: learning_ai_common_plat (platform-service, packages) · learning_voice_ai_agent (dashboards, clients)


Table of Contents

  1. Current Inventory
  2. Gap Analysis — Missing Components
  3. Implementation Priority Matrix
  4. New Cosmos Containers & Cost Impact
  5. New Environment Variables
  6. Quick Reference — Where Things Live

1. Current Inventory

1.1 Platform-Service Modules (25 modules)

Category Module Endpoints Description
Identity auth 11 routes Login, register, refresh, SSO, profile, admin user CRUD
Identity tokens 5 routes API token management (CRUD + validate)
Identity licenses 6 routes License key generation, activation, device binding, validate
Billing subscriptions 5 routes Plan management, trial tracking, period management
Billing stripe 2 routes Inbound Stripe webhook + portal session
Billing plans 4 routes Plan definitions (free, pro, enterprise)
Billing usage 4 routes Usage tracking and quota enforcement
Billing promos 5 routes Promo code creation, validation, redemption
Growth invitations 5 routes Invitation code generation, redemption, tracking
Growth referrals 5 routes Referral link tracking, status transitions
Growth waitlist 12 routes Pre-launch signups, position tracking, admin batch invite, CSV export
Growth public 5 routes Public roadmap, community voting, feature submissions
Content items 5 routes Tracker items (bugs, features, tasks)
Content comments 4 routes Threaded comments on items
Content votes 3 routes User votes on items and comments
Content memory 5 routes Memory items — create, reassign, patch, delete
Ops audit Query Audit log recording and admin queries
Ops flags 5 routes Feature flags with FNV-1a deterministic rollout
Ops telemetry 9 routes Client event ingestion, error clustering, collection policies, GDPR erasure
Ops notifications 5 routes Device registration, notification preferences
Ops settings 6 routes User/device settings, kill switch
Ops ratelimit 4 routes Rate limit checking, config management
Ops themes 7 routes Platform theming (iOS, Android, Desktop)
Ops blob 5 routes Azure Blob Storage SAS tokens, list, delete, info
Registry products 4 routes Multi-product registry with full lifecycle (draft → pre_launch → beta → active → sunset → disabled)

1.2 Shared Packages (13 packages)

Package Purpose
@bytelyst/errors Typed HTTP errors (400429)
@bytelyst/cosmos Cosmos DB client singleton + container registry
@bytelyst/config Zod env loader, product identity, AKV resolver
@bytelyst/auth JWT utilities, auth middleware, password hashing
@bytelyst/api-client Fetch wrapper with auth token injection
@bytelyst/fastify-core createServiceApp() factory + startService()
@bytelyst/react-auth React auth context factory
@bytelyst/logger Structured logging (pino-based)
@bytelyst/testing Shared test mocks, Fastify inject helpers
@bytelyst/blob Azure Blob Storage client + SAS helpers
@bytelyst/extraction Extraction client + shared types
@bytelyst/monitoring Health-check utilities
@bytelyst/design-tokens Cross-platform token generator (JSON → CSS/TS/Kotlin/Swift)

1.3 Services

Service Port Description
platform-service 4003 Consolidated Fastify service (25 modules, 621 tests)
extraction-service 4005 LangExtract text extraction + Python sidecar
monitoring 4004 Health-check aggregator (all services)

1.4 Dashboards

Dashboard Port Pages
admin-dashboard-web 3001 ~25 pages — users, billing, flags, ops, telemetry, secrets, etc.
user-dashboard-web 3002 User portal — subscription, usage, settings
tracker-dashboard-web 3003 Public roadmap, issue tracker, community voting

1.5 Infrastructure Already In Place

Component Status Notes
Health checks Per-service /health + aggregated monitoring script
Structured logging Pino (Fastify) + structlog (Python)
Log aggregation Loki + Grafana (Docker Compose)
Reverse proxy Traefik (Docker Compose)
Secret management Azure Key Vault + admin CRUD UI at /ops/secrets
Feature flags FNV-1a hash, percentage rollout, admin UI
Client telemetry All platforms instrumented, admin Client Logs page
Rate limiting In-memory sliding window + configurable rules per product
Outbound webhooks ⚠️ Partial Fire-and-forget POST for 3 events (lib/webhooks.ts); no subscription model, no retry, no HMAC signing
Kill switch Per-product, checked by all clients via /settings/kill-switch
Audit logging Records admin actions, queryable from admin dashboard
Blob storage 6 containers (audio, transcripts, attachments, avatars, releases, backups), SAS tokens, admin endpoints
Swagger / OpenAPI ⚠️ Partial createServiceApp() passes swagger config; Fastify plugin wired but Zod schemas not fully connected to route definitions via type provider
Prometheus metrics ⚠️ Partial metrics: true in createServiceApp() — basic request metrics exposed; no custom business metrics, no Grafana dashboards for them
Product registry Multi-product with full status lifecycle (draft → pre_launch → beta → active → sunset → disabled), prelaunch config, custom fields
Admin doc browser /docs page with markdown viewer, search, and AI chat — browses repo documentation

2. Gap Analysis — Missing Components

P0 — Foundational

These are blocking features that nearly every production app needs. Without them, critical operational workflows are manual or impossible.


2.1 Scheduled Jobs / Background Task Runner

Why: No way to run recurring work today. Trial expirations, subscription renewals, usage quota resets, stale data cleanup, digest emails, and report generation all require a scheduler.

Current state: Zero. All logic is request-driven (HTTP request → response).

Proposed design:

platform-service/src/modules/jobs/
├── types.ts         — JobDefinition, JobRun, JobSchedule schemas
├── registry.ts      — Job registry (register named jobs with cron expressions)
├── runner.ts        — Tick loop: evaluate cron, run due jobs, record outcomes
├── repository.ts    — Cosmos: job_definitions, job_runs containers
└── routes.ts        — Admin: list jobs, trigger manually, view run history, pause/resume

Built-in jobs to ship on day 1:

Job Schedule Description
trial-expiration-check Every hour Find subscriptions with status=trialing past currentPeriodEnd, transition to expired or active
usage-quota-reset Daily at midnight UTC Reset daily/monthly counters in usage_daily container
stale-session-cleanup Every 6 hours Remove expired refresh tokens and inactive sessions
telemetry-ttl-sweep Daily at 3am UTC Delete telemetry events past retention TTL (Cosmos TTL is best-effort)
waitlist-reminder Weekly Identify stale waitlist entries, mark for follow-up
license-expiry-check Daily Warn users whose licenses expire within 7 days

Options for the runner:

  • In-process tick loop (simplest): setInterval in platform-service, with leader election via Cosmos lease
  • Azure Functions timer triggers (serverless): Lower cost, built-in cron, but adds deployment complexity
  • BullMQ + Redis (heavy): Best for high-throughput, but adds a Redis dependency

Recommendation: Start with in-process tick loop + Cosmos lease for leader election (avoids Redis). Migrate to Azure Functions if job volume grows.

Admin UI:

  • /ops/jobs page: list all registered jobs, last run status, next scheduled run
  • Manual trigger button per job
  • Run history table with duration, outcome, error details
  • Pause/resume toggle per job

Cosmos containers:

  • job_definitions (pk: /productId) — name, cron, enabled, lastRunAt, nextRunAt
  • job_runs (pk: /productId:jobName) — runId, startedAt, completedAt, status, error, metrics

2.2 Transactional Email & Push Delivery

Why: The notifications module manages device registration and preferences, but has no delivery mechanism. Notifications are database records with no way to reach users.

Current state: Device registration + preference management only. No email, no push, no SMS.

Proposed design:

platform-service/src/modules/delivery/
├── types.ts         — DeliveryRequest, DeliveryLog, ChannelConfig schemas
├── channels/
│   ├── email.ts     — SendGrid/Postmark adapter
│   ├── push-apns.ts — Apple Push Notification Service
│   ├── push-fcm.ts  — Firebase Cloud Messaging
│   └── sms.ts       — Twilio/Azure Communication Services (future)
├── renderer.ts      — Template rendering (Handlebars for email bodies)
├── repository.ts    — delivery_log + email_templates containers
├── dispatcher.ts    — Route delivery request to correct channel(s) based on prefs
└── routes.ts        — Admin: send test, view delivery log, manage templates

Email templates to ship on day 1:

Template Trigger Description
welcome auth.register Welcome email with getting-started guide
trial-expiring jobs.trial-expiration-check (7d warning) "Your trial ends in 7 days"
trial-expired jobs.trial-expiration-check "Your trial has ended — upgrade to continue"
password-reset Future: /auth/forgot-password One-time reset link
invitation invitations.create "You've been invited to join"
waitlist-accepted waitlist.invite "You're in! Here's your access"
payment-failed stripe.invoice.payment_failed "We couldn't charge your card"
license-expiring jobs.license-expiry-check "Your license expires in 7 days"

Push notification types:

Type Channel Description
dictation_reminder APNs + FCM "Haven't dictated today — keep your streak!"
feature_announcement APNs + FCM Admin-triggered announcement
subscription_change APNs + FCM Plan upgraded/downgraded/expired

Cosmos container:

  • delivery_log (pk: /productId:channel:yyyyMM) — id, userId, channel, template, status (sent/failed/bounced), sentAt, error

Admin UI:

  • /ops/delivery page: delivery log with filters (channel, status, template, date range)
  • Template management: list, preview, edit (future: visual editor)
  • "Send test" button for each template
  • Delivery stats: sent/failed/bounced/opened (with SendGrid webhook integration)

2.3 Outbound Webhook Subscriptions

Why: Current webhooks.ts is fire-and-forget to env-var URLs with no retry, no signing, no subscriber management. External integrations (Zapier, Slack, custom) need a proper webhook subscription system.

Current state: 3 hardcoded webhook dispatchers (invitation redeemed, referral status changed, waitlist joined). No retry. No HMAC signing. No subscription management.

Proposed design:

platform-service/src/modules/webhooks/
├── types.ts         — WebhookSubscription, WebhookDelivery, WebhookEvent schemas
├── repository.ts    — Cosmos: webhook_subscriptions, webhook_deliveries containers
├── dispatcher.ts    — Match event → subscriptions, queue delivery, HMAC-SHA256 sign
├── delivery.ts      — HTTP POST with exponential backoff retry (3 attempts)
└── routes.ts        — Admin CRUD for subscriptions + delivery log

Event catalog (subscribe to any combination):

Event Payload Source
user.created { userId, email, plan } auth.register, auth.sso
user.deleted { userId } Admin: DELETE /auth/users/:id
subscription.created { subscriptionId, userId, plan, status } Registration hook
subscription.changed { subscriptionId, oldPlan, newPlan, status } Stripe webhook
subscription.canceled { subscriptionId, userId, reason } User action / Stripe
payment.succeeded { invoiceId, amount, userId } Stripe webhook
payment.failed { invoiceId, amount, userId, retryCount } Stripe webhook
invitation.redeemed { invitationId, userId } Invitation module
referral.completed { referralId, referrerId, referredId } Referral module
waitlist.joined { email, position } Waitlist module
flag.toggled { flagId, enabled, percentage } Flags module
license.activated { licenseId, userId, deviceId } License module
license.expired { licenseId, userId } Jobs: license-expiry-check

Security:

  • Every delivery signed with X-Webhook-Signature: sha256=<HMAC> using per-subscription secret
  • Subscription secret generated at creation time, displayed once, rotatable
  • Replay protection: X-Webhook-Timestamp header, reject if > 5 min old

Retry policy:

  • 3 attempts with exponential backoff: 10s → 60s → 300s
  • After 3 failures: mark subscription as failing, admin notification
  • After 10 consecutive failures: auto-disable subscription

Admin UI:

  • /ops/webhooks page: list subscriptions, create/edit/delete, test delivery
  • Delivery log: status (success/failed/retrying), response code, duration, payload preview
  • Per-subscription health indicator (green/yellow/red based on recent success rate)

Cosmos containers:

  • webhook_subscriptions (pk: /productId) — id, url, secret, events[], enabled, failureCount, lastDeliveryAt
  • webhook_deliveries (pk: /subscriptionId:yyyyMM) — id, event, status, attempts[], responseCode, duration

2.4 Async Event Bus / Internal Pub-Sub

Why: Today everything is synchronous request-response. As the platform grows, many operations should be fire-and-forget: audit log writes, webhook delivery, email sending, telemetry cluster updates, usage tracking. Without decoupling, any slow downstream operation blocks the API response.

Current state: Some fire-and-forget with unhandled promise rejections (e.g., telemetry cluster updates). No formal event bus.

Proposed design:

packages/events/
├── src/
│   ├── index.ts     — EventBus class, typed event definitions
│   ├── types.ts     — PlatformEvent union type, EventHandler interface
│   └── memory.ts    — In-memory implementation (default)

Event flow:

API route handler
  → bus.emit('user.created', { userId, email, plan })
    → [handler] audit.record()
    → [handler] webhook.dispatch()
    → [handler] email.sendWelcome()
    → [handler] analytics.track()

Implementation options:

  • Phase 1: In-memory EventEmitter wrapper with typed events (zero dependencies)
  • Phase 2: Azure Service Bus adapter for cross-service events
  • Phase 3: Azure Event Grid for external consumer webhooks

Typed event definitions (Zod):

const PlatformEvents = {
  'user.created': z.object({ userId: z.string(), email: z.string(), plan: z.string() }),
  'user.deleted': z.object({ userId: z.string() }),
  'subscription.changed': z.object({
    subscriptionId: z.string(),
    oldPlan: z.string(),
    newPlan: z.string(),
  }),
  'payment.failed': z.object({ invoiceId: z.string(), userId: z.string() }),
  // ... all events from webhook catalog
} as const;

Migration from existing lib/webhooks.ts:

  • Existing dispatchInvitationRedeemed(), dispatchReferralStatusChanged(), dispatchWaitlistJoined() become event bus subscribers
  • Phase 1: Register existing webhooks.ts functions as handlers on the bus
  • Phase 2: Replace inline dispatch calls in routes with bus.emit()
  • Phase 3: Remove lib/webhooks.ts once all callers migrated

Benefits:

  • Audit logging becomes a subscriber, not inline code
  • Webhook delivery becomes a subscriber, not inline code
  • Email sending becomes a subscriber, not inline code
  • New features can subscribe to events without modifying existing modules

2.5 Missing Auth Flows — Password Reset & Email Verification

Why: The auth module has login, register, SSO, and refresh — but no password reset and no email verification. These are table-stakes for any production auth system.

Current state: If a user forgets their password, there is no recovery path. Registration accepts any email without verification.

Proposed additions to auth module:

Password reset flow:

  1. POST /auth/forgot-password — accepts { email, productId }, generates a time-limited reset token (UUID), stores hash in password_reset_tokens container, sends email with reset link (via delivery module §2.2)
  2. POST /auth/reset-password — accepts { token, newPassword }, validates token, updates passwordHash, invalidates token, optionally revokes all sessions (§2.7)

Email verification flow:

  1. On register: generate verification token, store in email_verifications container, send email
  2. POST /auth/verify-email — accepts { token }, marks user email as verified
  3. POST /auth/resend-verification — rate-limited, re-sends verification email
  4. Add emailVerified: boolean field to UserDoc

Reset token document:

interface PasswordResetToken {
  id: string; // UUID
  productId: string;
  userId: string;
  tokenHash: string; // SHA-256 hash of the token (raw token sent via email)
  expiresAt: string; // 1 hour from creation
  usedAt?: string;
  createdAt: string;
}

Security considerations:

  • Store hash of token, not raw token (same pattern as API tokens)
  • Tokens expire in 1 hour
  • Rate limit: 3 reset requests per email per hour
  • After successful reset, invalidate all existing sessions
  • Log all reset attempts to audit

Cosmos container:

  • password_reset_tokens (pk: /productId) — short-lived, TTL 24h auto-expiry

Dependency: Requires email delivery (§2.2) for sending reset links and verification emails. Can ship the endpoints first with req.log.info-logged URLs for dev/testing (never console.log).


2.6 Public Status Page

Why: Users and admins need a single place to check if services are operational. The health-check script exists but has no user-facing output.

Current state: monitoring/health-check.ts polls services and prints to stdout. No persistent status, no incident history, no public URL.

Proposed design:

Option A — Self-hosted (minimal):

platform-service/src/modules/status/
├── types.ts         — ServiceStatus, Incident, MaintenanceWindow schemas
├── repository.ts    — Cosmos: service_status, incidents containers
├── poller.ts        — Periodic health poll (reuses @bytelyst/monitoring)
└── routes.ts        — Public: GET /public/status, GET /public/status/history

Option B — External (Instatus, Statuspage, or Upptime):

  • Upptime (GitHub-based, free, open-source) — runs as a GitHub Action, publishes to GitHub Pages
  • Better for public credibility (hosted on a separate domain)

Recommendation: Option A for internal/admin use, Option B for public-facing.

Status page data model:

Field Type Description
services array Current status per service (operational/degraded/down)
incidents array Active and past incidents with timeline
maintenanceWindows array Scheduled maintenance with start/end times
overallStatus enum operational / degraded / major_outage
lastCheckedAt ISO string When the poller last ran

Admin UI:

  • /ops/status page (or extend existing Mission Control /ops): service health cards with history sparklines
  • Incident management: create/update/resolve incidents with public-facing messages
  • Maintenance scheduling: create windows with auto-banners

P1 — Operational Maturity

These components improve reliability, debuggability, and operational efficiency. Not launch-blocking, but critical for a team running production services.


2.7 Session Management & Active Devices

Why: Licenses track deviceIds but there's no concept of active sessions. Users can't see where they're logged in. Admins can't force-revoke a compromised session. "Sign out all devices" is impossible.

Current state: JWT tokens with expiry. No session tracking. No revocation list. Refresh tokens are stateless.

Proposed design:

platform-service/src/modules/sessions/
├── types.ts         — SessionDoc, CreateSessionInput schemas
├── repository.ts    — Cosmos: sessions container (pk: /userId)
├── middleware.ts    — Session validation (check revocation on each request)
└── routes.ts        — User: list my sessions, revoke one, revoke all
                     — Admin: list user sessions, force-revoke

Session document:

interface SessionDoc {
  id: string; // session ID (embedded in JWT)
  productId: string;
  userId: string;
  deviceId?: string; // linked to license device
  platform: string; // ios, android, desktop, web
  ipAddress: string;
  userAgent: string;
  lastActiveAt: string;
  createdAt: string;
  revokedAt?: string;
  expiresAt: string;
}

Endpoints:

  • GET /sessions — list my active sessions
  • DELETE /sessions/:id — revoke specific session
  • DELETE /sessions — revoke all sessions (sign out everywhere)
  • GET /sessions/user/:userId — admin: list user's sessions
  • DELETE /sessions/user/:userId — admin: force-revoke all

Integration: Refresh token endpoint creates a session. Auth middleware checks session isn't revoked (Cosmos point-read by session ID, cached in-memory with short TTL).


2.8 Database Migration & Schema Evolution Tracker

Why: Cosmos DB is schemaless, but breaking changes still happen: new required fields, partition key changes, index policy updates, container renames. Without tracking, deployments are error-prone and rollbacks are impossible.

Current state: No migration tracking. Schema changes are applied ad-hoc.

Proposed design:

platform-service/src/migrations/
├── runner.ts        — Run pending migrations on startup (idempotent)
├── registry.ts      — List of migration files, ordered by version
└── migrations/
    ├── 001_add_productId_to_legacy_users.ts
    ├── 002_create_telemetry_containers.ts
    └── ...

Migration document (in migrations container):

interface MigrationDoc {
  id: string; // "001_add_productId_to_legacy_users"
  productId: string; // "platform"
  version: number;
  description: string;
  appliedAt: string;
  durationMs: number;
  status: 'applied' | 'failed' | 'rolled_back';
  error?: string;
}

Behavior:

  • On service startup, runner checks migrations container for applied versions
  • Runs any unapplied migrations in order
  • Each migration is idempotent (safe to re-run)
  • Failed migrations are recorded but don't block startup (logged as warnings)
  • Admin UI: /ops/migrations page showing applied/pending/failed

2.9 Data Export & Bulk Operations

Why: Admins regularly need: export users as CSV, export audit logs, bulk status updates, bulk license revocation. Today these require direct database queries.

Current state: Waitlist has a CSV export endpoint. Nothing else supports bulk operations.

Proposed design:

platform-service/src/modules/exports/
├── types.ts         — ExportJob, ExportFormat schemas
├── repository.ts    — Cosmos: export_jobs container
├── workers/
│   ├── users.ts     — Export users as CSV/JSON
│   ├── audit.ts     — Export audit log
│   ├── telemetry.ts — Export telemetry events
│   ├── usage.ts     — Export usage data
│   └── subscriptions.ts — Export subscriptions
└── routes.ts        — POST /exports (start), GET /exports (list), GET /exports/:id/download

Flow:

  1. Admin POST /api/exports{ type: 'users', format: 'csv', filters: { plan: 'free' } }
  2. Background job runs query, writes result to blob storage (via existing blob module)
  3. Job status updates: pendingprocessingready / failed
  4. Admin downloads from signed blob URL (SAS token via @bytelyst/blob)

Dependencies: blob module (existing) for storage, jobs module (§2.1) for auto-cleanup of expired exports.

Supported exports:

  • Users (with filters: plan, status, date range)
  • Audit log (with filters: action, userId, date range)
  • Telemetry events (with filters: platform, eventType, date range)
  • Usage records (with filters: userId, date range)
  • Subscriptions (with filters: plan, status)
  • Licenses (with filters: status, plan)

Admin UI:

  • /ops/exports page: create new export, list past exports, download links
  • Progress indicator for running exports
  • Auto-cleanup: delete export blobs after 7 days

2.10 Maintenance Mode & Graceful Degradation

Why: Kill switch is binary (on/off per product). Need nuanced control: read-only mode, specific features disabled, custom banner messages, admin bypass, scheduled windows.

Current state: settings/kill-switch endpoint returns boolean per product. Clients check and fully disable themselves.

Proposed design:

Extend the existing settings module:

interface MaintenanceConfig {
  mode: 'off' | 'read_only' | 'maintenance' | 'emergency';
  message: string; // Shown to users
  adminMessage?: string; // Shown to admins
  bypassRoles: string[]; // Roles that can bypass (e.g., ['admin', 'super_admin'])
  bypassIPs: string[]; // IP addresses that bypass
  scheduledStart?: string; // ISO — for planned maintenance
  scheduledEnd?: string;
  affectedServices: string[]; // ['api', 'dictation', 'extraction'] or ['*']
  updatedAt: string;
  updatedBy: string;
}

Modes:

  • off — Normal operation
  • read_only — GET requests allowed, writes blocked (for database maintenance)
  • maintenance — All requests return 503 with message (except admin bypass)
  • emergency — Kill switch + maintenance message + all clients show error

Endpoints:

  • GET /settings/maintenance — Public: check current mode + message
  • PUT /settings/maintenance — Admin: update mode, message, bypass rules
  • GET /settings/maintenance/schedule — Upcoming maintenance windows

Client integration:

  • Clients poll /settings/maintenance alongside kill-switch check
  • If mode !== 'off', show banner with message
  • If mode === 'maintenance', disable write operations with user-facing explanation

Admin UI:

  • Extend existing Settings page or add /ops/maintenance
  • Mode toggle (off/read-only/maintenance/emergency)
  • Message editor with preview
  • Schedule builder with start/end date pickers
  • Bypass IP whitelist management

Storage: Maintenance config is a single document per product in the existing settings container (field: maintenanceConfig). No new Cosmos container needed.


2.11 Rate Limit Dashboard & IP Allow/Deny Lists

Why: ratelimit module exists but admins have zero visibility into who's being rate-limited, and no ability to whitelist VIP users or blacklist abusive IPs.

Current state: In-memory sliding window rate limiter with configurable rules. No persistence, no admin visibility.

Proposed design:

Extend ratelimit module:

interface RateLimitEntry {
  key: string; // userId or IP
  productId: string;
  currentCount: number;
  windowStart: string;
  wasLimited: boolean;
  lastLimitedAt?: string;
}

interface IPRule {
  id: string;
  productId: string;
  ip: string; // CIDR notation supported
  action: 'allow' | 'deny';
  reason: string;
  createdBy: string;
  createdAt: string;
  expiresAt?: string; // Temporary blocks
}

Additional endpoints:

  • GET /ratelimit/stats — Admin: top rate-limited keys, total 429s in last hour/day
  • GET /ratelimit/blocked — Admin: currently blocked keys
  • POST /ratelimit/ip-rules — Admin: add IP allow/deny rule
  • GET /ratelimit/ip-rules — Admin: list rules
  • DELETE /ratelimit/ip-rules/:id — Admin: remove rule

Admin UI:

  • /ops/rate-limits page: real-time rate limit stats
  • Top offenders table (most 429 responses)
  • IP rules management (allow/deny with expiry)
  • Per-user rate limit override

Cosmos container:

  • ip_rules (pk: /productId) — persistent IP allow/deny rules
  • Rate limit stats remain in-memory (ephemeral); no persistence needed for counters

P2 — Product Intelligence

These components provide deeper insight into product health, user behavior, and experiment outcomes. They transform raw data into actionable intelligence.


2.12 A/B Testing & Experiments Framework

Why: Feature flags exist but only support on/off with percentage rollout. No variant assignment, metric collection, or statistical significance calculation.

Current state: flags module with boolean flags and FNV-1a deterministic rollout.

Proposed design:

Extend flags module or create sibling experiments module:

platform-service/src/modules/experiments/
├── types.ts         — Experiment, Variant, ExperimentMetric schemas
├── repository.ts    — Cosmos: experiments container
├── assignment.ts    — Deterministic variant assignment (extend FNV-1a)
├── analysis.ts      — Statistical significance calculation
└── routes.ts        — Admin CRUD + results endpoint

Experiment document:

interface ExperimentDoc {
  id: string;
  productId: string;
  name: string;
  hypothesis: string;
  status: 'draft' | 'running' | 'paused' | 'concluded';
  variants: Variant[]; // [{id: 'control', weight: 50}, {id: 'treatment', weight: 50}]
  targetingRules: FlagTargetingRules; // Reuse from flags module (platforms, versions, percentage)
  primaryMetric: string; // e.g., 'dictation_completed_rate'
  secondaryMetrics: string[];
  startedAt?: string;
  concludedAt?: string;
  winningVariant?: string;
  sampleSize: number;
  results?: ExperimentResults;
}

Admin UI:

  • /experiments page: list experiments, create new, view results
  • Results view: conversion rates per variant, confidence interval, statistical significance indicator
  • "Conclude" action: pick winner, auto-convert to feature flag

2.13 Analytics Aggregation Pipeline

Why: usage tracks raw events but there are no pre-aggregated rollups. Admin dashboard charts require expensive real-time queries. DAU/WAU/MAU, retention cohorts, and funnel analysis are impossible without rollups.

Current state: Raw usage_daily records. No aggregation.

Proposed design:

platform-service/src/modules/analytics/
├── types.ts         — MetricRollup, CohortEntry, FunnelStep schemas
├── repository.ts    — Cosmos: analytics_rollups container
├── rollup-jobs/
│   ├── dau-wau-mau.ts    — Daily/weekly/monthly active users
│   ├── retention.ts      — Cohort retention (D1, D7, D14, D30)
│   ├── funnel.ts         — Conversion funnels (signup → activate → dictate → subscribe)
│   └── feature-adoption.ts — Per-feature usage rates
└── routes.ts        — Admin: GET /analytics/dau, /retention, /funnel, /adoption

Rollup schedule (via jobs module):

  • DAU: every hour (incremental)
  • WAU/MAU: daily at 1am UTC
  • Retention cohorts: daily at 2am UTC
  • Funnels: daily at 2:30am UTC

Key metrics:

  • DAU/WAU/MAU — with breakdown by platform, plan
  • Retention cohorts — "Of users who signed up in week X, what % are active in week X+1, X+4?"
  • Conversion funnel — signup → first dictation → 5th dictation → subscription
  • Feature adoption — % of active users using each major feature
  • Revenue metrics — MRR, churn rate, ARPU, LTV (from subscriptions + Stripe data)

Admin UI:

  • Extend dashboard home or create /analytics page
  • Charts: DAU/WAU/MAU line chart, retention heatmap, funnel bar chart, MRR trend

2.14 In-App Feedback & Support Widget

Why: Tracker handles issue tracking but there's no way for end users to submit feedback directly from the app. Bug reports with device context, NPS surveys, and feature requests should flow into the tracker automatically.

Current state: Public roadmap allows feature submissions and voting. No in-app feedback widget.

Proposed design:

platform-service/src/modules/feedback/
├── types.ts         — FeedbackEntry, FeedbackType, DeviceContext schemas
├── repository.ts    — Cosmos: feedback container (pk: /productId)
└── routes.ts        — POST /feedback (authenticated), GET /feedback (admin query)

Feedback types:

  • bug_report — with device context, screenshot URL (blob), reproduction steps
  • feature_request — auto-creates tracker item in items module
  • nps_survey — score (0-10), comment, context
  • general — free-form text

Client integration:

  • Shake-to-report (iOS/Android) or keyboard shortcut (Desktop)
  • Auto-attach: device model, OS version, app version, current screen, last 10 telemetry events
  • Screenshot capture (optional, privacy-respecting)

Admin UI:

  • /feedback page: list feedback with filters (type, platform, date range, NPS score range)
  • Quick actions: convert to tracker item, reply, dismiss
  • NPS dashboard: score distribution over time, detractor/promoter breakdown

2.15 User Impersonation / Admin Shadow Mode

Why: When a user reports a bug, admins need to see exactly what they see. Without impersonation, debugging requires asking users for screenshots and steps, which is slow and unreliable.

Current state: No impersonation capability.

Proposed design:

Endpoint:

  • POST /auth/impersonate — Admin only. Accepts { targetUserId }. Returns a scoped shadow token.

Shadow token properties:

  • Contains impersonatedBy: adminUserId claim
  • Read-only by default (no writes unless explicitly allowed)
  • Expires in 15 minutes (non-renewable)
  • All actions logged to audit with impersonatedBy field
  • Visible banner in dashboard: "You are viewing as [user name] — all actions are audited"

Admin UI:

  • On the user detail page (/users/:id), add "View as User" button
  • Opens user dashboard in new tab with shadow token
  • Impersonation sessions listed on /ops/audit with filter

2.16 Changelog & In-App Release Notes

Why: Users should know what changed in each release. A changelog system also serves as internal documentation and can be shown as a "What's New" modal in the app.

Current state: CHANGELOG.md exists in the repo but nothing in-app.

Proposed design:

platform-service/src/modules/changelog/
├── types.ts         — ChangelogEntry, ReleaseNote schemas
├── repository.ts    — Cosmos: changelog container (pk: /productId)
└── routes.ts        — Public: GET /changelog (paginated)
                     — Admin: CRUD changelog entries

Entry document:

interface ChangelogEntry {
  id: string;
  productId: string;
  version: string; // "1.2.0"
  title: string;
  body: string; // Markdown
  category: 'feature' | 'improvement' | 'bugfix' | 'security';
  platforms: string[]; // ['ios', 'android', 'desktop', 'web']
  publishedAt?: string;
  isDraft: boolean;
  createdBy: string;
}

Client integration:

  • App checks GET /api/changelog?since=<lastSeenVersion> on launch
  • If new entries exist, show "What's New" modal
  • User can dismiss; lastSeenVersion stored in settings

Admin UI:

  • /changelog page: create/edit/publish entries with Markdown editor
  • Preview mode before publishing
  • Schedule publishing for future date

P3 — Scale & Polish

These components are important for scale, security, and developer experience, but are lower urgency.


2.17 CDN & Asset Pipeline

Why: Blob storage serves files directly from Azure. No edge caching, no image optimization, no automatic resizing for avatars/thumbnails.

Proposed approach:

  • Azure CDN or Cloudflare in front of blob storage
  • Image resize on upload (Sharp) for avatars: 64px, 128px, 256px variants
  • Cache headers: Cache-Control: public, max-age=31536000, immutable for content-addressed assets
  • Release binaries served via CDN for faster desktop app updates

2.18 API Versioning Strategy

Why: As external consumers appear (webhook integrations, third-party tools), breaking API changes need to be managed. Today all endpoints are unversioned.

Proposed approach:

  • URL prefix: /v1/api/...
  • Deprecation header: Sunset: <date> + Deprecation: true
  • Version lifecycle: currentdeprecated (6 months notice) → retired
  • OpenAPI spec generated per version
  • Fastify plugin that routes to versioned handlers

2.19 OpenAPI / Auto-Generated API Docs

Why: Platform-service already passes swagger config to createServiceApp(), but Zod schemas aren't fully wired to route definitions. The admin /docs page is a markdown doc browser (not API docs). Auto-generated API docs from Zod schemas would be nearly free.

Current state: @fastify/swagger is configured with title/description but route schemas aren't connected via @fastify/type-provider-zod. Swagger UI may already be partially served but without route-level detail.

Proposed approach:

  • Wire @fastify/type-provider-zod to connect existing Zod schemas to Fastify route definitions
  • Verify @fastify/swagger-ui is serving at /documentation on platform-service
  • Add route-level schema: { body, querystring, params, response } using existing Zod schemas
  • Export OpenAPI JSON at /documentation/json
  • Admin dashboard links to platform-service Swagger UI

2.20 Localization / i18n Service

Why: Centralized string management for all platforms. When adding a new language, change one place, not four codebases.

Proposed approach:

  • translations Cosmos container (pk: /productId:locale)
  • Admin UI: string management with translation status per locale
  • Client SDK: fetch translations on launch, cache locally
  • Fallback chain: requested locale → base locale → English

Why: Admin needs to search users by partial name/email. Users need to search memories/items. Cosmos SQL CONTAINS() is slow and doesn't rank results.

Proposed approach:

  • Phase 1: Cosmos DB full-text search (preview feature, no extra cost)
  • Phase 2: Azure AI Search for richer capabilities (fuzzy matching, facets, suggestions)
  • Admin UI: unified search bar across entities (users, items, audit logs)

2.22 Multi-Tenant Workspace / Org / Team Management

Why: productId scopes data per product, but within a product there's no team or organization concept. Enterprise customers need: org hierarchy, team-scoped permissions, shared brains/workspaces.

Proposed design (future):

users → belong to → organizations → have → teams → own → resources

This is a major architectural expansion. Defer until enterprise tier is validated.


2.23 Data Retention & Lifecycle Policies

Why: Telemetry has TTL. Other containers don't. Old audit logs, expired sessions, redeemed promos, and stale waitlist entries accumulate forever.

Proposed approach:

  • Admin-configurable retention policies per container
  • Scheduled job (from §2.1) runs cleanup
  • Default policies: audit (365 days), telemetry (30 days), sessions (90 days), export files (7 days)
  • Admin UI: /ops/retention page showing policies and next cleanup run

2.24 Automated Backup & Point-in-Time Restore

Why: Azure Cosmos DB has continuous backup, but admin needs visibility and one-click restore capability.

Proposed approach:

  • Admin UI: /ops/backups page showing Azure backup status
  • Manual export to blob (scheduled job from §2.1)
  • Restore button: triggers Azure Cosmos point-in-time restore API
  • Cross-region replication status indicator

2.25 Billing Dunning & Payment Recovery

Why: Stripe handles retries, but the platform needs to: notify users of failed payments, offer grace periods, and eventually downgrade plans.

Proposed flow:

  1. invoice.payment_failed → send "payment failed" email (§2.2) + in-app banner
  2. After 3 failures (Stripe Smart Retries) → send "final warning" email
  3. After grace period (7 days) → downgrade to free plan + email notification
  4. All transitions logged to audit

Integration: Stripe webhook handler (existing) + email delivery (§2.2) + scheduled job (§2.1) for grace period enforcement.


3. Implementation Priority Matrix

Phase Components Effort Dependencies Unlocks
Sprint 1 2.1 Scheduled Jobs M None Foundation for all time-based operations
Sprint 1 2.4 Event Bus S None Decoupling for email, webhooks, audit
Sprint 2 2.2 Email Delivery M 2.4 Event Bus User communication (welcome, trial expiry, payment failed)
Sprint 2 2.5 Password Reset + Email Verify S 2.2 Email Delivery Auth completeness — table-stakes for production
Sprint 3 2.3 Webhook Subscriptions M 2.4 Event Bus Third-party integrations, Zapier/Slack
Sprint 3 2.7 Session Management S None Security (sign out everywhere, revocation)
Sprint 4 2.10 Maintenance Mode S None Operational control during deployments
Sprint 4 2.9 Data Export S 2.1 Jobs (for blob cleanup) Admin self-service, compliance
Sprint 5 2.13 Analytics Rollups M 2.1 Jobs (for rollup scheduling) Dashboard charts, business metrics
Sprint 5 2.19 OpenAPI Docs S None Developer experience, API discoverability
Sprint 6 2.6 Status Page S None User trust, incident communication
Sprint 6 2.16 Changelog S None User engagement, release communication
Sprint 7 2.11 Rate Limit Dashboard S None Ops visibility
Sprint 7 2.25 Billing Dunning S 2.1 Jobs + 2.2 Email Payment recovery automation
Later 2.8, 2.12, 2.142.15, 2.172.18, 2.202.24 Varies Scale, polish, enterprise

Effort key: S = Small (12 days), M = Medium (35 days), L = Large (12 weeks)

Critical path: Event Bus (2.4) → Email Delivery (2.2) → Password Reset (2.5). These three should be the first items built, in that order.


4. New Cosmos Containers & Cost Impact

Each new component introduces Cosmos containers. Cosmos DB Serverless charges per RU consumed + storage, so idle containers cost only storage (~$0.25/GB/month).

Component New Containers Partition Key Est. TTL Est. Daily RU
2.1 Jobs job_definitions, job_runs /productId, /productId:jobName runs: 90d ~50 RU (low volume)
2.2 Email/Push delivery_log, email_templates /productId:channel:yyyyMM, /productId log: 90d ~200 RU
2.3 Webhooks webhook_subscriptions, webhook_deliveries /productId, /subscriptionId:yyyyMM deliveries: 30d ~100 RU
2.5 Password Reset password_reset_tokens, email_verifications /productId, /productId 24h auto ~10 RU
2.6 Status service_status, incidents /productId, /productId None ~20 RU
2.7 Sessions sessions /userId 90d ~500 RU (read-heavy)
2.8 Migrations migrations /productId None ~5 RU (startup only)
2.9 Exports export_jobs /productId 30d ~20 RU
2.12 Experiments experiments /productId None ~50 RU
2.13 Analytics analytics_rollups /productId:metric:period None ~300 RU (write-heavy during rollup)
2.11 IP Rules ip_rules /productId None (manual) ~10 RU
2.14 Feedback feedback /productId None ~50 RU
2.16 Changelog changelog /productId None ~10 RU
2.20 i18n translations /productId:locale None ~100 RU (read-heavy, cacheable)
2.23 Retention retention_policies /productId None ~5 RU

Total new containers: ~19 (across all phases) Existing containers: 27 (defined in cosmos-init.ts: products, users, settings, devices, notification_prefs, audit_log, feature_flags, invitation_codes, referrals, subscriptions, payments, licenses, plans, usage_daily, api_tokens, tracker_items, comments, votes, themes, waitlist, memory_items, daily_briefs, reflections, brain_insights, telemetry_events, telemetry_error_clusters, telemetry_collection_policies). Note: promos module uses Stripe API directly — no Cosmos container. Cost impact: Minimal for Serverless tier — idle containers only consume storage. Active containers during job runs add burst RU.

Recommendation: Register all new containers in cosmos-init.ts alongside existing ones. Use TTL liberally for transient data (tokens, deliveries, job runs) to keep storage bounded.


5. New Environment Variables

New components will require additional env vars. All should be added to .env.example files in both repos and documented.

Component Variable Example Required
2.1 Jobs JOB_RUNNER_ENABLED true No (default: true)
2.1 Jobs JOB_TICK_INTERVAL_MS 60000 No (default: 60s)
2.2 Email SENDGRID_API_KEY SG.xxx Yes (for email delivery)
2.2 Email EMAIL_FROM_ADDRESS noreply@lysnrai.com Yes
2.2 Email EMAIL_FROM_NAME LysnrAI No
2.2 Push APNS_KEY_ID ABC123 Yes (for iOS push)
2.2 Push APNS_TEAM_ID 748N7QPX7J Yes
2.2 Push APNS_KEY_PATH ./certs/AuthKey.p8 Yes
2.2 Push FCM_SERVICE_ACCOUNT_JSON {...} Yes (for Android push)
2.5 Auth PASSWORD_RESET_URL_BASE https://app.lysnrai.com/reset Yes
2.5 Auth EMAIL_VERIFY_URL_BASE https://app.lysnrai.com/verify Yes
2.10 Maintenance MAINTENANCE_MODE off No (default: off)
2.10 Maintenance MAINTENANCE_BYPASS_IPS 10.0.0.1,10.0.0.2 No
2.3 Webhooks WEBHOOK_DELIVERY_TIMEOUT_MS 5000 No (default: 5s)
2.3 Webhooks WEBHOOK_MAX_RETRIES 3 No (default: 3)
2.7 Sessions SESSION_TTL_DAYS 90 No (default: 90)
2.7 Sessions SESSION_CACHE_TTL_MS 30000 No (default: 30s)
2.19 OpenAPI SWAGGER_UI_ENABLED true No (default: true in dev)

Secret management: SENDGRID_API_KEY, APNS_*, and FCM_* should be added to Azure Key Vault as lysnr-sendgrid-api-key, lysnr-apns-key-id, etc. Update LYSNR_SECRETS in @bytelyst/config to include them.


6. Quick Reference — Where Things Live

Component Repo Path
Platform-service modules learning_ai_common_plat services/platform-service/src/modules/
Shared packages learning_ai_common_plat packages/
Admin dashboard learning_voice_ai_agent admin-dashboard-web/
User dashboard learning_voice_ai_agent user-dashboard-web/
Tracker dashboard learning_voice_ai_agent tracker-dashboard-web/
Docker Compose both repos docker-compose.yml
Monitoring learning_ai_common_plat services/monitoring/
Design tokens learning_ai_common_plat packages/design-tokens/
MindLyst native app learning_multimodal_memory_agents mindlyst-native/ (KMP + SwiftUI + Compose + Next.js)
MindLyst web learning_multimodal_memory_agents mindlyst-native/web/
Existing webhooks learning_ai_common_plat services/platform-service/src/lib/webhooks.ts
Cosmos container defs learning_ai_common_plat services/platform-service/src/lib/cosmos-init.ts
Telemetry design doc learning_ai_common_plat docs/WINDSURF/CLIENT_TELEMETRY_DESIGN.md
Telemetry roadmap learning_ai_common_plat docs/WINDSURF/TELEMETRY_ROADMAP.md
This document learning_ai_common_plat docs/WINDSURF/PLATFORM_COMPONENTS_ROADMAP.md

Appendix A: Risks & Open Questions

# Topic Risk / Question Mitigation
1 Leader election for jobs In-process tick loop with Cosmos lease — what happens during deploys? Two instances may briefly both hold leases. Cosmos lease has a built-in TTL. Use 30s lease with 10s renewal. During deploy overlap, the old instance's lease expires before the new one acquires. Jobs must be idempotent.
2 Email deliverability SendGrid requires domain verification (SPF/DKIM/DMARC). Without it, emails land in spam. Set up lysnrai.com domain authentication in SendGrid before shipping §2.2. Budget 12 days for DNS propagation.
3 Session validation latency Checking Cosmos on every request for session revocation adds ~510ms per request. In-memory cache with 30s TTL (§2.7). Revocation is eventually consistent — acceptable trade-off for most apps. Document the 30s window.
4 Cosmos container proliferation 28 existing + 19 new = 47 containers. Serverless tier has no per-container cost, but management complexity grows. Group related containers by module. Document all containers in cosmos-init.ts. Consider container-per-module naming convention.
5 Event bus ordering guarantees In-memory EventEmitter has no ordering guarantees across handlers. If audit must record before webhook fires, ordering matters. Phase 1: Document that handlers run concurrently with no ordering. If ordering is needed, use handler priority weights or sequential mode.
6 Push notification certificates APNs requires yearly certificate renewal. If it expires, all iOS push silently stops. Add apns-cert-expiry-check to scheduled jobs (§2.1). Alert admin 30 days before expiry.
7 Webhook abuse External subscribers could register slow endpoints that back up the delivery queue. Per-subscription timeout (5s default), circuit breaker after 10 consecutive failures, auto-disable.
8 Migration rollback Cosmos is schemaless — some migrations (e.g., partition key changes) are irreversible. Mark migrations as reversible: true/false. Require manual approval for irreversible migrations. Always back up before running.
9 MindLyst parity MindLyst web uses Cosmos directly (in-memory fallback). Shared components (email, sessions, webhooks) must work for MindLyst too, not just LysnrAI. All new modules use productId for multi-product isolation. MindLyst can consume the same platform-service APIs.
10 Priority conflicts Sprint plan assumes available engineering bandwidth. If telemetry or mobile work takes priority, these sprints slip. Treat sprint assignments as relative ordering, not calendar commitments. Re-evaluate after each sprint.

Appendix B: Component Dependency Graph

                    ┌─────────────────────┐
                    │   Event Bus (2.4)    │
                    └─────────┬───────────┘
                              │ emits to subscribers
        ┌───────────┼───────────┼───────────┐
        │           │           │           │
        ▼           ▼           ▼           ▼
┌───────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐
│ Email/Push│ │ Webhook   │ │ Audit Log │ │ Analytics  │
│ (2.2)     │ │ (2.3)     │ │ (existing)│ │ (2.13)    │
└─────┬─────┘ └───────────┘ └───────────┘ └───────────┘
      │
      │ sends
      ▼
┌───────────┐
│ Password  │
│ Reset(2.5)│
└───────────┘

┌───────────────┐──▶┌─────────────────┐   ┌─────────────────┐
│ Scheduled     │   │ Analytics       │   │ Blob Storage    │
│ Jobs (2.1)    │   │ Rollups (2.13)  │   │ (existing)      │
└───────┬───────┘   └─────────────────┘   └────────┬────────┘
        │                                       │
        │ triggers on schedule                   ▲ writes exports
        ▼                                       │
┌───────────────┐   ┌─────────────────┐   ┌─────────────────┐
│ Trial Expiry  │   │ Usage Reset     │   │ Data Export      │
│ (2.1 job)     │   │ (2.1 job)       │   │ (2.9)           │
└───────────────┘   └─────────────────┘   └─────────────────┘

┌───────────────┐   ┌─────────────────┐   ┌─────────────────┐
│ Billing       │──▶│ Email/Push      │   │ Retention        │
│ Dunning(2.25) │   │ Delivery (2.2)  │   │ Cleanup (2.23)   │
└───────────────┘   └─────────────────┘   └─────────────────┘

Appendix C: Review Findings

Systematic review performed 2026-02-17. All issues below have been fixed inline.

# Severity Section Finding Fix
1 Bug §1.3 Test count stale: said "158+ tests" — actual count is 621 (verified via grep -c 'it(' *.test.ts). Updated to 621.
2 Bug §1.1 Endpoint column inconsistent: some modules said "CRUD" (vague, could be 48 routes), others had exact counts. Replaced all "CRUD" with actual route counts.
3 Bug §2.5 Said "console-logged URLs for dev/testing" — violates project rule: never console.log in production code. Changed to req.log.info.
4 Bug §2.12 ExperimentDoc.targetingRules: {} — meaningless empty object type. Changed to FlagTargetingRules (reuse from flags module).
5 Bug §2.3 Webhook event user.deleted source said auth.delete — no such endpoint name. Actual route is DELETE /auth/users/:id (admin action). Fixed source column.
6 Bug §4 email_verifications container (from §2.5) missing from Cosmos table. Only password_reset_tokens was listed. Added email_verifications to §2.5 row.
7 Bug §4 Existing container count said "~25+" — actual is 27 (counted from cosmos-init.ts; promos uses Stripe API directly, no Cosmos container). Updated to 27 with full container list.
8 Bug §4 Total new containers said "~17" — after adding email_verifications and ip_rules, count is 19. Updated.
9 Gap §2.2 No clarity on email template storage strategy. renderer.ts mentioned but not whether templates are Cosmos-stored or file-based. Clarified: repository.ts now references delivery_log + email_templates containers.
10 Gap §2.4 No migration strategy from existing lib/webhooks.ts to new event bus pattern. Added "Migration from existing lib/webhooks.ts" subsection with 3-phase plan.
11 Gap §2.10 Maintenance mode proposed extending settings module but didn't clarify storage location. Missing from §4 Cosmos table. Added: stored as single document per product in existing settings container (no new container needed).
12 Gap §2.11 IP rules need persistence but no container was mentioned. Missing from §4 table. Added ip_rules container (pk: /productId) to both §2.11 and §4 table.
13 Gap §2.9 Data Export didn't mention blob module dependency (exports written to blob storage). Added explicit dependency note on blob module and jobs module for cleanup.
14 Gap §5 Missing env vars for webhooks (timeout, retries) and sessions (TTL, cache TTL). Added 4 new env vars: WEBHOOK_DELIVERY_TIMEOUT_MS, WEBHOOK_MAX_RETRIES, SESSION_TTL_DAYS, SESSION_CACHE_TTL_MS.
15 Gap §6 Quick Reference missing MindLyst repo (learning_multimodal_memory_agents). Doc scope says "ByteLyst platform" which includes MindLyst. Added MindLyst native app and web entries. Also added cosmos-init.ts path.
16 Gap Appendix Dependency graph incomplete: missing Jobs → Data Export connection, missing Blob → Data Export dependency, downstream jobs not labeled with section numbers. Rewrote graph with all connections and section labels.
17 Gap Overall No "Risks & Open Questions" section — design docs should call out unknowns. Added Appendix A with 10 risk items and mitigations.
18 Gap TOC Table of Contents didn't include Appendix sections. Added Appendix A, B, C to TOC.
19 Gap §2.5 Password reset cross-referenced "§2.6" for sessions but sessions was renumbered to §2.7 in previous edit pass. Fixed to §2.7 (caught in prior pass).
20 Gap §1.5 Infrastructure table was missing Swagger/OpenAPI (partially wired) and Prometheus metrics (partially enabled). Added in prior pass — verified still present.

This document is a living brainstorm. Items will be promoted to dedicated design docs (like CLIENT_TELEMETRY_DESIGN.md) as they move into implementation.