saravanakumardb1 80a4459f81 docs: update documentation

2026-02-17 10:49:14 -08:00

74 KiB

Raw Blame History

Platform Components Roadmap — What's Built, What's Missing, What's Next

Status: Living document — brainstorm + gap analysis
Last updated: 2026-02-17
Scope: All infrastructure components relevant to admin, DevOps, and product operations across the ByteLyst platform.
Repos: learning_ai_common_plat (platform-service, packages) · learning_voice_ai_agent (dashboards, clients)

Current Inventory
Gap Analysis — Missing Components
Implementation Priority Matrix
New Cosmos Containers & Cost Impact
New Environment Variables
Quick Reference — Where Things Live

Appendix A: Risks & Open Questions
Appendix B: Component Dependency Graph
Appendix C: Review Findings

1. Current Inventory

1.1 Platform-Service Modules (25 modules)

Category	Module	Endpoints	Description
Identity	`auth`	11 routes	Login, register, refresh, SSO, profile, admin user CRUD
Identity	`tokens`	5 routes	API token management (CRUD + validate)
Identity	`licenses`	6 routes	License key generation, activation, device binding, validate
Billing	`subscriptions`	5 routes	Plan management, trial tracking, period management
Billing	`stripe`	2 routes	Inbound Stripe webhook + portal session
Billing	`plans`	4 routes	Plan definitions (free, pro, enterprise)
Billing	`usage`	4 routes	Usage tracking and quota enforcement
Billing	`promos`	5 routes	Promo code creation, validation, redemption
Growth	`invitations`	5 routes	Invitation code generation, redemption, tracking
Growth	`referrals`	5 routes	Referral link tracking, status transitions
Growth	`waitlist`	12 routes	Pre-launch signups, position tracking, admin batch invite, CSV export
Growth	`public`	5 routes	Public roadmap, community voting, feature submissions
Content	`items`	5 routes	Tracker items (bugs, features, tasks)
Content	`comments`	4 routes	Threaded comments on items
Content	`votes`	3 routes	User votes on items and comments
Content	`memory`	5 routes	Memory items — create, reassign, patch, delete
Ops	`audit`	Query	Audit log recording and admin queries
Ops	`flags`	5 routes	Feature flags with FNV-1a deterministic rollout
Ops	`telemetry`	9 routes	Client event ingestion, error clustering, collection policies, GDPR erasure
Ops	`notifications`	5 routes	Device registration, notification preferences
Ops	`settings`	6 routes	User/device settings, kill switch
Ops	`ratelimit`	4 routes	Rate limit checking, config management
Ops	`themes`	7 routes	Platform theming (iOS, Android, Desktop)
Ops	`blob`	5 routes	Azure Blob Storage SAS tokens, list, delete, info
Registry	`products`	4 routes	Multi-product registry with full lifecycle (draft → pre_launch → beta → active → sunset → disabled)

1.2 Shared Packages (13 packages)

Package	Purpose
`@bytelyst/errors`	Typed HTTP errors (400–429)
`@bytelyst/cosmos`	Cosmos DB client singleton + container registry
`@bytelyst/config`	Zod env loader, product identity, AKV resolver
`@bytelyst/auth`	JWT utilities, auth middleware, password hashing
`@bytelyst/api-client`	Fetch wrapper with auth token injection
`@bytelyst/fastify-core`	`createServiceApp()` factory + `startService()`
`@bytelyst/react-auth`	React auth context factory
`@bytelyst/logger`	Structured logging (pino-based)
`@bytelyst/testing`	Shared test mocks, Fastify inject helpers
`@bytelyst/blob`	Azure Blob Storage client + SAS helpers
`@bytelyst/extraction`	Extraction client + shared types
`@bytelyst/monitoring`	Health-check utilities
`@bytelyst/design-tokens`	Cross-platform token generator (JSON → CSS/TS/Kotlin/Swift)

1.3 Services

Service	Port	Description
platform-service	4003	Consolidated Fastify service (25 modules, 621 tests)
extraction-service	4005	LangExtract text extraction + Python sidecar
monitoring	4004	Health-check aggregator (all services)

1.4 Dashboards

Dashboard	Port	Pages
admin-dashboard-web	3001	~25 pages — users, billing, flags, ops, telemetry, secrets, etc.
user-dashboard-web	3002	User portal — subscription, usage, settings
tracker-dashboard-web	3003	Public roadmap, issue tracker, community voting

1.5 Infrastructure Already In Place

Component	Status	Notes
Health checks	✅	Per-service `/health` + aggregated monitoring script
Structured logging	✅	Pino (Fastify) + structlog (Python)
Log aggregation	✅	Loki + Grafana (Docker Compose)
Reverse proxy	✅	Traefik (Docker Compose)
Secret management	✅	Azure Key Vault + admin CRUD UI at `/ops/secrets`
Feature flags	✅	FNV-1a hash, percentage rollout, admin UI
Client telemetry	✅	All platforms instrumented, admin Client Logs page
Rate limiting	✅	In-memory sliding window + configurable rules per product
Outbound webhooks	⚠️ Partial	Fire-and-forget POST for 3 events (`lib/webhooks.ts`); no subscription model, no retry, no HMAC signing
Kill switch	✅	Per-product, checked by all clients via `/settings/kill-switch`
Audit logging	✅	Records admin actions, queryable from admin dashboard
Blob storage	✅	6 containers (audio, transcripts, attachments, avatars, releases, backups), SAS tokens, admin endpoints
Swagger / OpenAPI	⚠️ Partial	`createServiceApp()` passes `swagger` config; Fastify plugin wired but Zod schemas not fully connected to route definitions via type provider
Prometheus metrics	⚠️ Partial	`metrics: true` in `createServiceApp()` — basic request metrics exposed; no custom business metrics, no Grafana dashboards for them
Product registry	✅	Multi-product with full status lifecycle (draft → pre_launch → beta → active → sunset → disabled), prelaunch config, custom fields
Admin doc browser	✅	`/docs` page with markdown viewer, search, and AI chat — browses repo documentation

2. Gap Analysis — Missing Components

P0 — Foundational

These are blocking features that nearly every production app needs. Without them, critical operational workflows are manual or impossible.

2.1 Scheduled Jobs / Background Task Runner

Why: No way to run recurring work today. Trial expirations, subscription renewals, usage quota resets, stale data cleanup, digest emails, and report generation all require a scheduler.

Current state: Zero. All logic is request-driven (HTTP request → response).

Proposed design:

platform-service/src/modules/jobs/
├── types.ts         — JobDefinition, JobRun, JobSchedule schemas
├── registry.ts      — Job registry (register named jobs with cron expressions)
├── runner.ts        — Tick loop: evaluate cron, run due jobs, record outcomes
├── repository.ts    — Cosmos: job_definitions, job_runs containers
└── routes.ts        — Admin: list jobs, trigger manually, view run history, pause/resume

Built-in jobs to ship on day 1:

Job	Schedule	Description
`trial-expiration-check`	Every hour	Find subscriptions with `status=trialing` past `currentPeriodEnd`, transition to `expired` or `active`
`usage-quota-reset`	Daily at midnight UTC	Reset daily/monthly counters in `usage_daily` container
`stale-session-cleanup`	Every 6 hours	Remove expired refresh tokens and inactive sessions
`telemetry-ttl-sweep`	Daily at 3am UTC	Delete telemetry events past retention TTL (Cosmos TTL is best-effort)
`waitlist-reminder`	Weekly	Identify stale waitlist entries, mark for follow-up
`license-expiry-check`	Daily	Warn users whose licenses expire within 7 days

Options for the runner:

In-process tick loop (simplest): setInterval in platform-service, with leader election via Cosmos lease
Azure Functions timer triggers (serverless): Lower cost, built-in cron, but adds deployment complexity
BullMQ + Redis (heavy): Best for high-throughput, but adds a Redis dependency

Recommendation: Start with in-process tick loop + Cosmos lease for leader election (avoids Redis). Migrate to Azure Functions if job volume grows.

Admin UI:

/ops/jobs page: list all registered jobs, last run status, next scheduled run
Manual trigger button per job
Run history table with duration, outcome, error details
Pause/resume toggle per job

Cosmos containers:

job_definitions (pk: /productId) — name, cron, enabled, lastRunAt, nextRunAt
job_runs (pk: /productId:jobName) — runId, startedAt, completedAt, status, error, metrics

2.2 Transactional Email & Push Delivery

Why: The notifications module manages device registration and preferences, but has no delivery mechanism. Notifications are database records with no way to reach users.

Current state: Device registration + preference management only. No email, no push, no SMS.

Proposed design:

platform-service/src/modules/delivery/
├── types.ts         — DeliveryRequest, DeliveryLog, ChannelConfig schemas
├── channels/
│   ├── email.ts     — SendGrid/Postmark adapter
│   ├── push-apns.ts — Apple Push Notification Service
│   ├── push-fcm.ts  — Firebase Cloud Messaging
│   └── sms.ts       — Twilio/Azure Communication Services (future)
├── renderer.ts      — Template rendering (Handlebars for email bodies)
├── repository.ts    — delivery_log + email_templates containers
├── dispatcher.ts    — Route delivery request to correct channel(s) based on prefs
└── routes.ts        — Admin: send test, view delivery log, manage templates

Email templates to ship on day 1:

Template	Trigger	Description
`welcome`	`auth.register`	Welcome email with getting-started guide
`trial-expiring`	`jobs.trial-expiration-check` (7d warning)	"Your trial ends in 7 days"
`trial-expired`	`jobs.trial-expiration-check`	"Your trial has ended — upgrade to continue"
`password-reset`	Future: `/auth/forgot-password`	One-time reset link
`invitation`	`invitations.create`	"You've been invited to join"
`waitlist-accepted`	`waitlist.invite`	"You're in! Here's your access"
`payment-failed`	`stripe.invoice.payment_failed`	"We couldn't charge your card"
`license-expiring`	`jobs.license-expiry-check`	"Your license expires in 7 days"

Push notification types:

Type	Channel	Description
`dictation_reminder`	APNs + FCM	"Haven't dictated today — keep your streak!"
`feature_announcement`	APNs + FCM	Admin-triggered announcement
`subscription_change`	APNs + FCM	Plan upgraded/downgraded/expired

Cosmos container:

delivery_log (pk: /productId:channel:yyyyMM) — id, userId, channel, template, status (sent/failed/bounced), sentAt, error

Admin UI:

/ops/delivery page: delivery log with filters (channel, status, template, date range)
Template management: list, preview, edit (future: visual editor)
"Send test" button for each template
Delivery stats: sent/failed/bounced/opened (with SendGrid webhook integration)

2.3 Outbound Webhook Subscriptions

Why: Current webhooks.ts is fire-and-forget to env-var URLs with no retry, no signing, no subscriber management. External integrations (Zapier, Slack, custom) need a proper webhook subscription system.

Current state: 3 hardcoded webhook dispatchers (invitation redeemed, referral status changed, waitlist joined). No retry. No HMAC signing. No subscription management.

Proposed design:

platform-service/src/modules/webhooks/
├── types.ts         — WebhookSubscription, WebhookDelivery, WebhookEvent schemas
├── repository.ts    — Cosmos: webhook_subscriptions, webhook_deliveries containers
├── dispatcher.ts    — Match event → subscriptions, queue delivery, HMAC-SHA256 sign
├── delivery.ts      — HTTP POST with exponential backoff retry (3 attempts)
└── routes.ts        — Admin CRUD for subscriptions + delivery log

Event catalog (subscribe to any combination):

Event	Payload	Source
`user.created`	`{ userId, email, plan }`	`auth.register`, `auth.sso`
`user.deleted`	`{ userId }`	Admin: `DELETE /auth/users/:id`
`subscription.created`	`{ subscriptionId, userId, plan, status }`	Registration hook
`subscription.changed`	`{ subscriptionId, oldPlan, newPlan, status }`	Stripe webhook
`subscription.canceled`	`{ subscriptionId, userId, reason }`	User action / Stripe
`payment.succeeded`	`{ invoiceId, amount, userId }`	Stripe webhook
`payment.failed`	`{ invoiceId, amount, userId, retryCount }`	Stripe webhook
`invitation.redeemed`	`{ invitationId, userId }`	Invitation module
`referral.completed`	`{ referralId, referrerId, referredId }`	Referral module
`waitlist.joined`	`{ email, position }`	Waitlist module
`flag.toggled`	`{ flagId, enabled, percentage }`	Flags module
`license.activated`	`{ licenseId, userId, deviceId }`	License module
`license.expired`	`{ licenseId, userId }`	Jobs: license-expiry-check

Security:

Every delivery signed with X-Webhook-Signature: sha256=<HMAC> using per-subscription secret
Subscription secret generated at creation time, displayed once, rotatable
Replay protection: X-Webhook-Timestamp header, reject if > 5 min old

Retry policy:

3 attempts with exponential backoff: 10s → 60s → 300s
After 3 failures: mark subscription as failing, admin notification
After 10 consecutive failures: auto-disable subscription

Admin UI:

/ops/webhooks page: list subscriptions, create/edit/delete, test delivery
Delivery log: status (success/failed/retrying), response code, duration, payload preview
Per-subscription health indicator (green/yellow/red based on recent success rate)

Cosmos containers:

webhook_subscriptions (pk: /productId) — id, url, secret, events[], enabled, failureCount, lastDeliveryAt
webhook_deliveries (pk: /subscriptionId:yyyyMM) — id, event, status, attempts[], responseCode, duration

2.4 Async Event Bus / Internal Pub-Sub

Why: Today everything is synchronous request-response. As the platform grows, many operations should be fire-and-forget: audit log writes, webhook delivery, email sending, telemetry cluster updates, usage tracking. Without decoupling, any slow downstream operation blocks the API response.

Current state: Some fire-and-forget with unhandled promise rejections (e.g., telemetry cluster updates). No formal event bus.

Proposed design:

packages/events/
├── src/
│   ├── index.ts     — EventBus class, typed event definitions
│   ├── types.ts     — PlatformEvent union type, EventHandler interface
│   └── memory.ts    — In-memory implementation (default)

Event flow:

API route handler
  → bus.emit('user.created', { userId, email, plan })
    → [handler] audit.record()
    → [handler] webhook.dispatch()
    → [handler] email.sendWelcome()
    → [handler] analytics.track()

Implementation options:

Phase 1: In-memory EventEmitter wrapper with typed events (zero dependencies)
Phase 2: Azure Service Bus adapter for cross-service events
Phase 3: Azure Event Grid for external consumer webhooks

Typed event definitions (Zod):

const PlatformEvents = {
  'user.created': z.object({ userId: z.string(), email: z.string(), plan: z.string() }),
  'user.deleted': z.object({ userId: z.string() }),
  'subscription.changed': z.object({
    subscriptionId: z.string(),
    oldPlan: z.string(),
    newPlan: z.string(),
  }),
  'payment.failed': z.object({ invoiceId: z.string(), userId: z.string() }),
  // ... all events from webhook catalog
} as const;

Migration from existing lib/webhooks.ts:

Existing dispatchInvitationRedeemed(), dispatchReferralStatusChanged(), dispatchWaitlistJoined() become event bus subscribers
Phase 1: Register existing webhooks.ts functions as handlers on the bus
Phase 2: Replace inline dispatch calls in routes with bus.emit()
Phase 3: Remove lib/webhooks.ts once all callers migrated

Benefits:

Audit logging becomes a subscriber, not inline code
Webhook delivery becomes a subscriber, not inline code
Email sending becomes a subscriber, not inline code
New features can subscribe to events without modifying existing modules

2.5 Missing Auth Flows — Password Reset & Email Verification

Why: The auth module has login, register, SSO, and refresh — but no password reset and no email verification. These are table-stakes for any production auth system.

Current state: If a user forgets their password, there is no recovery path. Registration accepts any email without verification.

Proposed additions to auth module:

Password reset flow:

POST /auth/forgot-password — accepts { email, productId }, generates a time-limited reset token (UUID), stores hash in password_reset_tokens container, sends email with reset link (via delivery module §2.2)
POST /auth/reset-password — accepts { token, newPassword }, validates token, updates passwordHash, invalidates token, optionally revokes all sessions (§2.7)

Email verification flow:

On register: generate verification token, store in email_verifications container, send email
POST /auth/verify-email — accepts { token }, marks user email as verified
POST /auth/resend-verification — rate-limited, re-sends verification email
Add emailVerified: boolean field to UserDoc

Reset token document:

interface PasswordResetToken {
  id: string; // UUID
  productId: string;
  userId: string;
  tokenHash: string; // SHA-256 hash of the token (raw token sent via email)
  expiresAt: string; // 1 hour from creation
  usedAt?: string;
  createdAt: string;
}

Security considerations:

Store hash of token, not raw token (same pattern as API tokens)
Tokens expire in 1 hour
Rate limit: 3 reset requests per email per hour
After successful reset, invalidate all existing sessions
Log all reset attempts to audit

Cosmos container:

password_reset_tokens (pk: /productId) — short-lived, TTL 24h auto-expiry

Dependency: Requires email delivery (§2.2) for sending reset links and verification emails. Can ship the endpoints first with req.log.info-logged URLs for dev/testing (never console.log).

2.6 Public Status Page

Why: Users and admins need a single place to check if services are operational. The health-check script exists but has no user-facing output.

Current state: monitoring/health-check.ts polls services and prints to stdout. No persistent status, no incident history, no public URL.

Proposed design:

Option A — Self-hosted (minimal):

platform-service/src/modules/status/
├── types.ts         — ServiceStatus, Incident, MaintenanceWindow schemas
├── repository.ts    — Cosmos: service_status, incidents containers
├── poller.ts        — Periodic health poll (reuses @bytelyst/monitoring)
└── routes.ts        — Public: GET /public/status, GET /public/status/history

Option B — External (Instatus, Statuspage, or Upptime):

Upptime (GitHub-based, free, open-source) — runs as a GitHub Action, publishes to GitHub Pages
Better for public credibility (hosted on a separate domain)

Recommendation: Option A for internal/admin use, Option B for public-facing.

Status page data model:

Field	Type	Description
`services`	array	Current status per service (operational/degraded/down)
`incidents`	array	Active and past incidents with timeline
`maintenanceWindows`	array	Scheduled maintenance with start/end times
`overallStatus`	enum	`operational` / `degraded` / `major_outage`
`lastCheckedAt`	ISO string	When the poller last ran

Admin UI:

/ops/status page (or extend existing Mission Control /ops): service health cards with history sparklines
Incident management: create/update/resolve incidents with public-facing messages
Maintenance scheduling: create windows with auto-banners

P1 — Operational Maturity

These components improve reliability, debuggability, and operational efficiency. Not launch-blocking, but critical for a team running production services.

2.7 Session Management & Active Devices

Why: Licenses track deviceIds but there's no concept of active sessions. Users can't see where they're logged in. Admins can't force-revoke a compromised session. "Sign out all devices" is impossible.

Current state: JWT tokens with expiry. No session tracking. No revocation list. Refresh tokens are stateless.

Proposed design:

platform-service/src/modules/sessions/
├── types.ts         — SessionDoc, CreateSessionInput schemas
├── repository.ts    — Cosmos: sessions container (pk: /userId)
├── middleware.ts    — Session validation (check revocation on each request)
└── routes.ts        — User: list my sessions, revoke one, revoke all
                     — Admin: list user sessions, force-revoke

Session document:

interface SessionDoc {
  id: string; // session ID (embedded in JWT)
  productId: string;
  userId: string;
  deviceId?: string; // linked to license device
  platform: string; // ios, android, desktop, web
  ipAddress: string;
  userAgent: string;
  lastActiveAt: string;
  createdAt: string;
  revokedAt?: string;
  expiresAt: string;
}

Endpoints:

GET /sessions — list my active sessions
DELETE /sessions/:id — revoke specific session
DELETE /sessions — revoke all sessions (sign out everywhere)
GET /sessions/user/:userId — admin: list user's sessions
DELETE /sessions/user/:userId — admin: force-revoke all

Integration: Refresh token endpoint creates a session. Auth middleware checks session isn't revoked (Cosmos point-read by session ID, cached in-memory with short TTL).

2.8 Database Migration & Schema Evolution Tracker

Why: Cosmos DB is schemaless, but breaking changes still happen: new required fields, partition key changes, index policy updates, container renames. Without tracking, deployments are error-prone and rollbacks are impossible.

Current state: No migration tracking. Schema changes are applied ad-hoc.

Proposed design:

platform-service/src/migrations/
├── runner.ts        — Run pending migrations on startup (idempotent)
├── registry.ts      — List of migration files, ordered by version
└── migrations/
    ├── 001_add_productId_to_legacy_users.ts
    ├── 002_create_telemetry_containers.ts
    └── ...

Migration document (in migrations container):

interface MigrationDoc {
  id: string; // "001_add_productId_to_legacy_users"
  productId: string; // "platform"
  version: number;
  description: string;
  appliedAt: string;
  durationMs: number;
  status: 'applied' | 'failed' | 'rolled_back';
  error?: string;
}

Behavior:

On service startup, runner checks migrations container for applied versions
Runs any unapplied migrations in order
Each migration is idempotent (safe to re-run)
Failed migrations are recorded but don't block startup (logged as warnings)
Admin UI: /ops/migrations page showing applied/pending/failed

2.9 Data Export & Bulk Operations

Why: Admins regularly need: export users as CSV, export audit logs, bulk status updates, bulk license revocation. Today these require direct database queries.

Current state: Waitlist has a CSV export endpoint. Nothing else supports bulk operations.

Proposed design:

platform-service/src/modules/exports/
├── types.ts         — ExportJob, ExportFormat schemas
├── repository.ts    — Cosmos: export_jobs container
├── workers/
│   ├── users.ts     — Export users as CSV/JSON
│   ├── audit.ts     — Export audit log
│   ├── telemetry.ts — Export telemetry events
│   ├── usage.ts     — Export usage data
│   └── subscriptions.ts — Export subscriptions
└── routes.ts        — POST /exports (start), GET /exports (list), GET /exports/:id/download

Flow:

Admin POST /api/exports → { type: 'users', format: 'csv', filters: { plan: 'free' } }
Background job runs query, writes result to blob storage (via existing blob module)
Job status updates: pending → processing → ready / failed
Admin downloads from signed blob URL (SAS token via @bytelyst/blob)

Dependencies: blob module (existing) for storage, jobs module (§2.1) for auto-cleanup of expired exports.

Supported exports:

Users (with filters: plan, status, date range)
Audit log (with filters: action, userId, date range)
Telemetry events (with filters: platform, eventType, date range)
Usage records (with filters: userId, date range)
Subscriptions (with filters: plan, status)
Licenses (with filters: status, plan)

Admin UI:

/ops/exports page: create new export, list past exports, download links
Progress indicator for running exports
Auto-cleanup: delete export blobs after 7 days

2.10 Maintenance Mode & Graceful Degradation

Why: Kill switch is binary (on/off per product). Need nuanced control: read-only mode, specific features disabled, custom banner messages, admin bypass, scheduled windows.

Current state: settings/kill-switch endpoint returns boolean per product. Clients check and fully disable themselves.

Proposed design:

Extend the existing settings module:

interface MaintenanceConfig {
  mode: 'off' | 'read_only' | 'maintenance' | 'emergency';
  message: string; // Shown to users
  adminMessage?: string; // Shown to admins
  bypassRoles: string[]; // Roles that can bypass (e.g., ['admin', 'super_admin'])
  bypassIPs: string[]; // IP addresses that bypass
  scheduledStart?: string; // ISO — for planned maintenance
  scheduledEnd?: string;
  affectedServices: string[]; // ['api', 'dictation', 'extraction'] or ['*']
  updatedAt: string;
  updatedBy: string;
}

Modes:

off — Normal operation
read_only — GET requests allowed, writes blocked (for database maintenance)
maintenance — All requests return 503 with message (except admin bypass)
emergency — Kill switch + maintenance message + all clients show error

Endpoints:

GET /settings/maintenance — Public: check current mode + message
PUT /settings/maintenance — Admin: update mode, message, bypass rules
GET /settings/maintenance/schedule — Upcoming maintenance windows

Client integration:

Clients poll /settings/maintenance alongside kill-switch check
If mode !== 'off', show banner with message
If mode === 'maintenance', disable write operations with user-facing explanation

Admin UI:

Extend existing Settings page or add /ops/maintenance
Mode toggle (off/read-only/maintenance/emergency)
Message editor with preview
Schedule builder with start/end date pickers
Bypass IP whitelist management

Storage: Maintenance config is a single document per product in the existing settings container (field: maintenanceConfig). No new Cosmos container needed.

2.11 Rate Limit Dashboard & IP Allow/Deny Lists

Why: ratelimit module exists but admins have zero visibility into who's being rate-limited, and no ability to whitelist VIP users or blacklist abusive IPs.

Current state: In-memory sliding window rate limiter with configurable rules. No persistence, no admin visibility.

Proposed design:

Extend ratelimit module:

interface RateLimitEntry {
  key: string; // userId or IP
  productId: string;
  currentCount: number;
  windowStart: string;
  wasLimited: boolean;
  lastLimitedAt?: string;
}

interface IPRule {
  id: string;
  productId: string;
  ip: string; // CIDR notation supported
  action: 'allow' | 'deny';
  reason: string;
  createdBy: string;
  createdAt: string;
  expiresAt?: string; // Temporary blocks
}

Additional endpoints:

GET /ratelimit/stats — Admin: top rate-limited keys, total 429s in last hour/day
GET /ratelimit/blocked — Admin: currently blocked keys
POST /ratelimit/ip-rules — Admin: add IP allow/deny rule
GET /ratelimit/ip-rules — Admin: list rules
DELETE /ratelimit/ip-rules/:id — Admin: remove rule

Admin UI:

/ops/rate-limits page: real-time rate limit stats
Top offenders table (most 429 responses)
IP rules management (allow/deny with expiry)
Per-user rate limit override

Cosmos container:

ip_rules (pk: /productId) — persistent IP allow/deny rules
Rate limit stats remain in-memory (ephemeral); no persistence needed for counters

P2 — Product Intelligence

These components provide deeper insight into product health, user behavior, and experiment outcomes. They transform raw data into actionable intelligence.

2.12 A/B Testing & Experiments Framework

Why: Feature flags exist but only support on/off with percentage rollout. No variant assignment, metric collection, or statistical significance calculation.

Current state: flags module with boolean flags and FNV-1a deterministic rollout.

Proposed design:

Extend flags module or create sibling experiments module:

platform-service/src/modules/experiments/
├── types.ts         — Experiment, Variant, ExperimentMetric schemas
├── repository.ts    — Cosmos: experiments container
├── assignment.ts    — Deterministic variant assignment (extend FNV-1a)
├── analysis.ts      — Statistical significance calculation
└── routes.ts        — Admin CRUD + results endpoint

Experiment document:

interface ExperimentDoc {
  id: string;
  productId: string;
  name: string;
  hypothesis: string;
  status: 'draft' | 'running' | 'paused' | 'concluded';
  variants: Variant[]; // [{id: 'control', weight: 50}, {id: 'treatment', weight: 50}]
  targetingRules: FlagTargetingRules; // Reuse from flags module (platforms, versions, percentage)
  primaryMetric: string; // e.g., 'dictation_completed_rate'
  secondaryMetrics: string[];
  startedAt?: string;
  concludedAt?: string;
  winningVariant?: string;
  sampleSize: number;
  results?: ExperimentResults;
}

Admin UI:

/experiments page: list experiments, create new, view results
Results view: conversion rates per variant, confidence interval, statistical significance indicator
"Conclude" action: pick winner, auto-convert to feature flag

2.13 Analytics Aggregation Pipeline

Why: usage tracks raw events but there are no pre-aggregated rollups. Admin dashboard charts require expensive real-time queries. DAU/WAU/MAU, retention cohorts, and funnel analysis are impossible without rollups.

Current state: Raw usage_daily records. No aggregation.

Proposed design:

platform-service/src/modules/analytics/
├── types.ts         — MetricRollup, CohortEntry, FunnelStep schemas
├── repository.ts    — Cosmos: analytics_rollups container
├── rollup-jobs/
│   ├── dau-wau-mau.ts    — Daily/weekly/monthly active users
│   ├── retention.ts      — Cohort retention (D1, D7, D14, D30)
│   ├── funnel.ts         — Conversion funnels (signup → activate → dictate → subscribe)
│   └── feature-adoption.ts — Per-feature usage rates
└── routes.ts        — Admin: GET /analytics/dau, /retention, /funnel, /adoption

Rollup schedule (via jobs module):

DAU: every hour (incremental)
WAU/MAU: daily at 1am UTC
Retention cohorts: daily at 2am UTC
Funnels: daily at 2:30am UTC

Key metrics:

DAU/WAU/MAU — with breakdown by platform, plan
Retention cohorts — "Of users who signed up in week X, what % are active in week X+1, X+4?"
Conversion funnel — signup → first dictation → 5th dictation → subscription
Feature adoption — % of active users using each major feature
Revenue metrics — MRR, churn rate, ARPU, LTV (from subscriptions + Stripe data)

Admin UI:

Extend dashboard home or create /analytics page
Charts: DAU/WAU/MAU line chart, retention heatmap, funnel bar chart, MRR trend

Why: Tracker handles issue tracking but there's no way for end users to submit feedback directly from the app. Bug reports with device context, NPS surveys, and feature requests should flow into the tracker automatically.

Current state: Public roadmap allows feature submissions and voting. No in-app feedback widget.

Proposed design:

platform-service/src/modules/feedback/
├── types.ts         — FeedbackEntry, FeedbackType, DeviceContext schemas
├── repository.ts    — Cosmos: feedback container (pk: /productId)
└── routes.ts        — POST /feedback (authenticated), GET /feedback (admin query)

Feedback types:

bug_report — with device context, screenshot URL (blob), reproduction steps
feature_request — auto-creates tracker item in items module
nps_survey — score (0-10), comment, context
general — free-form text

Client integration:

Shake-to-report (iOS/Android) or keyboard shortcut (Desktop)
Auto-attach: device model, OS version, app version, current screen, last 10 telemetry events
Screenshot capture (optional, privacy-respecting)

Admin UI:

/feedback page: list feedback with filters (type, platform, date range, NPS score range)
Quick actions: convert to tracker item, reply, dismiss
NPS dashboard: score distribution over time, detractor/promoter breakdown

2.15 User Impersonation / Admin Shadow Mode

Why: When a user reports a bug, admins need to see exactly what they see. Without impersonation, debugging requires asking users for screenshots and steps, which is slow and unreliable.

Current state: No impersonation capability.

Proposed design:

Endpoint:

POST /auth/impersonate — Admin only. Accepts { targetUserId }. Returns a scoped shadow token.

Shadow token properties:

Contains impersonatedBy: adminUserId claim
Read-only by default (no writes unless explicitly allowed)
Expires in 15 minutes (non-renewable)
All actions logged to audit with impersonatedBy field
Visible banner in dashboard: "You are viewing as [user name] — all actions are audited"

Admin UI:

On the user detail page (/users/:id), add "View as User" button
Opens user dashboard in new tab with shadow token
Impersonation sessions listed on /ops/audit with filter

2.16 Changelog & In-App Release Notes

Why: Users should know what changed in each release. A changelog system also serves as internal documentation and can be shown as a "What's New" modal in the app.

Current state: CHANGELOG.md exists in the repo but nothing in-app.

Proposed design:

platform-service/src/modules/changelog/
├── types.ts         — ChangelogEntry, ReleaseNote schemas
├── repository.ts    — Cosmos: changelog container (pk: /productId)
└── routes.ts        — Public: GET /changelog (paginated)
                     — Admin: CRUD changelog entries

Entry document:

interface ChangelogEntry {
  id: string;
  productId: string;
  version: string; // "1.2.0"
  title: string;
  body: string; // Markdown
  category: 'feature' | 'improvement' | 'bugfix' | 'security';
  platforms: string[]; // ['ios', 'android', 'desktop', 'web']
  publishedAt?: string;
  isDraft: boolean;
  createdBy: string;
}

Client integration:

App checks GET /api/changelog?since=<lastSeenVersion> on launch
If new entries exist, show "What's New" modal
User can dismiss; lastSeenVersion stored in settings

Admin UI:

/changelog page: create/edit/publish entries with Markdown editor
Preview mode before publishing
Schedule publishing for future date

P3 — Scale & Polish

These components are important for scale, security, and developer experience, but are lower urgency.

2.17 CDN & Asset Pipeline

Why: Blob storage serves files directly from Azure. No edge caching, no image optimization, no automatic resizing for avatars/thumbnails.

Proposed approach:

Azure CDN or Cloudflare in front of blob storage
Image resize on upload (Sharp) for avatars: 64px, 128px, 256px variants
Cache headers: Cache-Control: public, max-age=31536000, immutable for content-addressed assets
Release binaries served via CDN for faster desktop app updates

2.18 API Versioning Strategy

Why: As external consumers appear (webhook integrations, third-party tools), breaking API changes need to be managed. Today all endpoints are unversioned.

Proposed approach:

URL prefix: /v1/api/...
Deprecation header: Sunset: <date> + Deprecation: true
Version lifecycle: current → deprecated (6 months notice) → retired
OpenAPI spec generated per version
Fastify plugin that routes to versioned handlers

2.19 OpenAPI / Auto-Generated API Docs

Why: Platform-service already passes swagger config to createServiceApp(), but Zod schemas aren't fully wired to route definitions. The admin /docs page is a markdown doc browser (not API docs). Auto-generated API docs from Zod schemas would be nearly free.

Current state: @fastify/swagger is configured with title/description but route schemas aren't connected via @fastify/type-provider-zod. Swagger UI may already be partially served but without route-level detail.

Proposed approach:

Wire @fastify/type-provider-zod to connect existing Zod schemas to Fastify route definitions
Verify @fastify/swagger-ui is serving at /documentation on platform-service
Add route-level schema: { body, querystring, params, response } using existing Zod schemas
Export OpenAPI JSON at /documentation/json
Admin dashboard links to platform-service Swagger UI

2.20 Localization / i18n Service

Why: Centralized string management for all platforms. When adding a new language, change one place, not four codebases.

Proposed approach:

translations Cosmos container (pk: /productId:locale)
Admin UI: string management with translation status per locale
Client SDK: fetch translations on launch, cache locally
Fallback chain: requested locale → base locale → English

2.21 Full-Text Search

Why: Admin needs to search users by partial name/email. Users need to search memories/items. Cosmos SQL CONTAINS() is slow and doesn't rank results.

Proposed approach:

Phase 1: Cosmos DB full-text search (preview feature, no extra cost)
Phase 2: Azure AI Search for richer capabilities (fuzzy matching, facets, suggestions)
Admin UI: unified search bar across entities (users, items, audit logs)

2.22 Multi-Tenant Workspace / Org / Team Management

Why: productId scopes data per product, but within a product there's no team or organization concept. Enterprise customers need: org hierarchy, team-scoped permissions, shared brains/workspaces.

Proposed design (future):

users → belong to → organizations → have → teams → own → resources

This is a major architectural expansion. Defer until enterprise tier is validated.

2.23 Data Retention & Lifecycle Policies

Why: Telemetry has TTL. Other containers don't. Old audit logs, expired sessions, redeemed promos, and stale waitlist entries accumulate forever.

Proposed approach:

Admin-configurable retention policies per container
Scheduled job (from §2.1) runs cleanup
Default policies: audit (365 days), telemetry (30 days), sessions (90 days), export files (7 days)
Admin UI: /ops/retention page showing policies and next cleanup run

2.24 Automated Backup & Point-in-Time Restore

Why: Azure Cosmos DB has continuous backup, but admin needs visibility and one-click restore capability.

Proposed approach:

Admin UI: /ops/backups page showing Azure backup status
Manual export to blob (scheduled job from §2.1)
Restore button: triggers Azure Cosmos point-in-time restore API
Cross-region replication status indicator

2.25 Billing Dunning & Payment Recovery

Why: Stripe handles retries, but the platform needs to: notify users of failed payments, offer grace periods, and eventually downgrade plans.

Proposed flow:

invoice.payment_failed → send "payment failed" email (§2.2) + in-app banner
After 3 failures (Stripe Smart Retries) → send "final warning" email
After grace period (7 days) → downgrade to free plan + email notification
All transitions logged to audit

Integration: Stripe webhook handler (existing) + email delivery (§2.2) + scheduled job (§2.1) for grace period enforcement.

3. Implementation Priority Matrix

Phase	Components	Effort	Dependencies	Unlocks
Sprint 1	2.1 Scheduled Jobs	M	None	Foundation for all time-based operations
Sprint 1	2.4 Event Bus	S	None	Decoupling for email, webhooks, audit
Sprint 2	2.2 Email Delivery	M	2.4 Event Bus	User communication (welcome, trial expiry, payment failed)
Sprint 2	2.5 Password Reset + Email Verify	S	2.2 Email Delivery	Auth completeness — table-stakes for production
Sprint 3	2.3 Webhook Subscriptions	M	2.4 Event Bus	Third-party integrations, Zapier/Slack
Sprint 3	2.7 Session Management	S	None	Security (sign out everywhere, revocation)
Sprint 4	2.10 Maintenance Mode	S	None	Operational control during deployments
Sprint 4	2.9 Data Export	S	2.1 Jobs (for blob cleanup)	Admin self-service, compliance
Sprint 5	2.13 Analytics Rollups	M	2.1 Jobs (for rollup scheduling)	Dashboard charts, business metrics
Sprint 5	2.19 OpenAPI Docs	S	None	Developer experience, API discoverability
Sprint 6	2.6 Status Page	S	None	User trust, incident communication
Sprint 6	2.16 Changelog	S	None	User engagement, release communication
Sprint 7	2.11 Rate Limit Dashboard	S	None	Ops visibility
Sprint 7	2.25 Billing Dunning	S	2.1 Jobs + 2.2 Email	Payment recovery automation
Later	2.8, 2.12, 2.14–2.15, 2.17–2.18, 2.20–2.24	Varies	—	Scale, polish, enterprise

Effort key: S = Small (1–2 days), M = Medium (3–5 days), L = Large (1–2 weeks)

Critical path: Event Bus (2.4) → Email Delivery (2.2) → Password Reset (2.5). These three should be the first items built, in that order.

4. New Cosmos Containers & Cost Impact

Each new component introduces Cosmos containers. Cosmos DB Serverless charges per RU consumed + storage, so idle containers cost only storage (~$0.25/GB/month).

Component	New Containers	Partition Key	Est. TTL	Est. Daily RU
2.1 Jobs	`job_definitions`, `job_runs`	`/productId`, `/productId:jobName`	runs: 90d	~50 RU (low volume)
2.2 Email/Push	`delivery_log`, `email_templates`	`/productId:channel:yyyyMM`, `/productId`	log: 90d	~200 RU
2.3 Webhooks	`webhook_subscriptions`, `webhook_deliveries`	`/productId`, `/subscriptionId:yyyyMM`	deliveries: 30d	~100 RU
2.5 Password Reset	`password_reset_tokens`, `email_verifications`	`/productId`, `/productId`	24h auto	~10 RU
2.6 Status	`service_status`, `incidents`	`/productId`, `/productId`	None	~20 RU
2.7 Sessions	`sessions`	`/userId`	90d	~500 RU (read-heavy)
2.8 Migrations	`migrations`	`/productId`	None	~5 RU (startup only)
2.9 Exports	`export_jobs`	`/productId`	30d	~20 RU
2.12 Experiments	`experiments`	`/productId`	None	~50 RU
2.13 Analytics	`analytics_rollups`	`/productId:metric:period`	None	~300 RU (write-heavy during rollup)
2.11 IP Rules	`ip_rules`	`/productId`	None (manual)	~10 RU
2.14 Feedback	`feedback`	`/productId`	None	~50 RU
2.16 Changelog	`changelog`	`/productId`	None	~10 RU
2.20 i18n	`translations`	`/productId:locale`	None	~100 RU (read-heavy, cacheable)
2.23 Retention	`retention_policies`	`/productId`	None	~5 RU

Total new containers: ~19 (across all phases) Existing containers: 27 (defined in cosmos-init.ts: products, users, settings, devices, notification_prefs, audit_log, feature_flags, invitation_codes, referrals, subscriptions, payments, licenses, plans, usage_daily, api_tokens, tracker_items, comments, votes, themes, waitlist, memory_items, daily_briefs, reflections, brain_insights, telemetry_events, telemetry_error_clusters, telemetry_collection_policies). Note: promos module uses Stripe API directly — no Cosmos container. Cost impact: Minimal for Serverless tier — idle containers only consume storage. Active containers during job runs add burst RU.

Recommendation: Register all new containers in cosmos-init.ts alongside existing ones. Use TTL liberally for transient data (tokens, deliveries, job runs) to keep storage bounded.

5. New Environment Variables

New components will require additional env vars. All should be added to .env.example files in both repos and documented.

Component	Variable	Example	Required
2.1 Jobs	`JOB_RUNNER_ENABLED`	`true`	No (default: true)
2.1 Jobs	`JOB_TICK_INTERVAL_MS`	`60000`	No (default: 60s)
2.2 Email	`SENDGRID_API_KEY`	`SG.xxx`	Yes (for email delivery)
2.2 Email	`EMAIL_FROM_ADDRESS`	`noreply@lysnrai.com`	Yes
2.2 Email	`EMAIL_FROM_NAME`	`LysnrAI`	No
2.2 Push	`APNS_KEY_ID`	`ABC123`	Yes (for iOS push)
2.2 Push	`APNS_TEAM_ID`	`748N7QPX7J`	Yes
2.2 Push	`APNS_KEY_PATH`	`./certs/AuthKey.p8`	Yes
2.2 Push	`FCM_SERVICE_ACCOUNT_JSON`	`{...}`	Yes (for Android push)
2.5 Auth	`PASSWORD_RESET_URL_BASE`	`https://app.lysnrai.com/reset`	Yes
2.5 Auth	`EMAIL_VERIFY_URL_BASE`	`https://app.lysnrai.com/verify`	Yes
2.10 Maintenance	`MAINTENANCE_MODE`	`off`	No (default: off)
2.10 Maintenance	`MAINTENANCE_BYPASS_IPS`	`10.0.0.1,10.0.0.2`	No
2.3 Webhooks	`WEBHOOK_DELIVERY_TIMEOUT_MS`	`5000`	No (default: 5s)
2.3 Webhooks	`WEBHOOK_MAX_RETRIES`	`3`	No (default: 3)
2.7 Sessions	`SESSION_TTL_DAYS`	`90`	No (default: 90)
2.7 Sessions	`SESSION_CACHE_TTL_MS`	`30000`	No (default: 30s)
2.19 OpenAPI	`SWAGGER_UI_ENABLED`	`true`	No (default: true in dev)

Secret management: SENDGRID_API_KEY, APNS_*, and FCM_* should be added to Azure Key Vault as lysnr-sendgrid-api-key, lysnr-apns-key-id, etc. Update LYSNR_SECRETS in @bytelyst/config to include them.

6. Quick Reference — Where Things Live

Component	Repo	Path
Platform-service modules	`learning_ai_common_plat`	`services/platform-service/src/modules/`
Shared packages	`learning_ai_common_plat`	`packages/`
Admin dashboard	`learning_voice_ai_agent`	`admin-dashboard-web/`
User dashboard	`learning_voice_ai_agent`	`user-dashboard-web/`
Tracker dashboard	`learning_voice_ai_agent`	`tracker-dashboard-web/`
Docker Compose	both repos	`docker-compose.yml`
Monitoring	`learning_ai_common_plat`	`services/monitoring/`
Design tokens	`learning_ai_common_plat`	`packages/design-tokens/`
MindLyst native app	`learning_multimodal_memory_agents`	`mindlyst-native/` (KMP + SwiftUI + Compose + Next.js)
MindLyst web	`learning_multimodal_memory_agents`	`mindlyst-native/web/`
Existing webhooks	`learning_ai_common_plat`	`services/platform-service/src/lib/webhooks.ts`
Cosmos container defs	`learning_ai_common_plat`	`services/platform-service/src/lib/cosmos-init.ts`
Telemetry design doc	`learning_ai_common_plat`	`docs/WINDSURF/CLIENT_TELEMETRY_DESIGN.md`
Telemetry roadmap	`learning_ai_common_plat`	`docs/WINDSURF/TELEMETRY_ROADMAP.md`
This document	`learning_ai_common_plat`	`docs/WINDSURF/PLATFORM_COMPONENTS_ROADMAP.md`

Appendix A: Risks & Open Questions

#	Topic	Risk / Question	Mitigation
1	Leader election for jobs	In-process tick loop with Cosmos lease — what happens during deploys? Two instances may briefly both hold leases.	Cosmos lease has a built-in TTL. Use 30s lease with 10s renewal. During deploy overlap, the old instance's lease expires before the new one acquires. Jobs must be idempotent.
2	Email deliverability	SendGrid requires domain verification (SPF/DKIM/DMARC). Without it, emails land in spam.	Set up `lysnrai.com` domain authentication in SendGrid before shipping §2.2. Budget 1–2 days for DNS propagation.
3	Session validation latency	Checking Cosmos on every request for session revocation adds ~5–10ms per request.	In-memory cache with 30s TTL (§2.7). Revocation is eventually consistent — acceptable trade-off for most apps. Document the 30s window.
4	Cosmos container proliferation	28 existing + 19 new = 47 containers. Serverless tier has no per-container cost, but management complexity grows.	Group related containers by module. Document all containers in `cosmos-init.ts`. Consider container-per-module naming convention.
5	Event bus ordering guarantees	In-memory `EventEmitter` has no ordering guarantees across handlers. If audit must record before webhook fires, ordering matters.	Phase 1: Document that handlers run concurrently with no ordering. If ordering is needed, use handler priority weights or sequential mode.
6	Push notification certificates	APNs requires yearly certificate renewal. If it expires, all iOS push silently stops.	Add `apns-cert-expiry-check` to scheduled jobs (§2.1). Alert admin 30 days before expiry.
7	Webhook abuse	External subscribers could register slow endpoints that back up the delivery queue.	Per-subscription timeout (5s default), circuit breaker after 10 consecutive failures, auto-disable.
8	Migration rollback	Cosmos is schemaless — some migrations (e.g., partition key changes) are irreversible.	Mark migrations as `reversible: true/false`. Require manual approval for irreversible migrations. Always back up before running.
9	MindLyst parity	MindLyst web uses Cosmos directly (in-memory fallback). Shared components (email, sessions, webhooks) must work for MindLyst too, not just LysnrAI.	All new modules use `productId` for multi-product isolation. MindLyst can consume the same platform-service APIs.
10	Priority conflicts	Sprint plan assumes available engineering bandwidth. If telemetry or mobile work takes priority, these sprints slip.	Treat sprint assignments as relative ordering, not calendar commitments. Re-evaluate after each sprint.

Appendix B: Component Dependency Graph

                    ┌─────────────────────┐
                    │   Event Bus (2.4)    │
                    └─────────┬───────────┘
                              │ emits to subscribers
        ┌───────────┼───────────┼───────────┐
        │           │           │           │
        ▼           ▼           ▼           ▼
┌───────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐
│ Email/Push│ │ Webhook   │ │ Audit Log │ │ Analytics  │
│ (2.2)     │ │ (2.3)     │ │ (existing)│ │ (2.13)    │
└─────┬─────┘ └───────────┘ └───────────┘ └───────────┘
      │
      │ sends
      ▼
┌───────────┐
│ Password  │
│ Reset(2.5)│
└───────────┘

┌───────────────┐──▶┌─────────────────┐   ┌─────────────────┐
│ Scheduled     │   │ Analytics       │   │ Blob Storage    │
│ Jobs (2.1)    │   │ Rollups (2.13)  │   │ (existing)      │
└───────┬───────┘   └─────────────────┘   └────────┬────────┘
        │                                       │
        │ triggers on schedule                   ▲ writes exports
        ▼                                       │
┌───────────────┐   ┌─────────────────┐   ┌─────────────────┐
│ Trial Expiry  │   │ Usage Reset     │   │ Data Export      │
│ (2.1 job)     │   │ (2.1 job)       │   │ (2.9)           │
└───────────────┘   └─────────────────┘   └─────────────────┘

┌───────────────┐   ┌─────────────────┐   ┌─────────────────┐
│ Billing       │──▶│ Email/Push      │   │ Retention        │
│ Dunning(2.25) │   │ Delivery (2.2)  │   │ Cleanup (2.23)   │
└───────────────┘   └─────────────────┘   └─────────────────┘

Appendix C: Review Findings

Systematic review performed 2026-02-17. All issues below have been fixed inline.

#	Severity	Section	Finding	Fix
1	Bug	§1.3	Test count stale: said "158+ tests" — actual count is 621 (verified via `grep -c 'it(' *.test.ts`).	Updated to 621.
2	Bug	§1.1	Endpoint column inconsistent: some modules said "CRUD" (vague, could be 4–8 routes), others had exact counts.	Replaced all "CRUD" with actual route counts.
3	Bug	§2.5	Said "console-logged URLs for dev/testing" — violates project rule: never `console.log` in production code.	Changed to `req.log.info`.
4	Bug	§2.12	`ExperimentDoc.targetingRules: {}` — meaningless empty object type.	Changed to `FlagTargetingRules` (reuse from flags module).
5	Bug	§2.3	Webhook event `user.deleted` source said `auth.delete` — no such endpoint name. Actual route is `DELETE /auth/users/:id` (admin action).	Fixed source column.
6	Bug	§4	`email_verifications` container (from §2.5) missing from Cosmos table. Only `password_reset_tokens` was listed.	Added `email_verifications` to §2.5 row.
7	Bug	§4	Existing container count said "~25+" — actual is 27 (counted from `cosmos-init.ts`; `promos` uses Stripe API directly, no Cosmos container).	Updated to 27 with full container list.
8	Bug	§4	Total new containers said "~17" — after adding `email_verifications` and `ip_rules`, count is 19.	Updated.
9	Gap	§2.2	No clarity on email template storage strategy. `renderer.ts` mentioned but not whether templates are Cosmos-stored or file-based.	Clarified: `repository.ts` now references `delivery_log + email_templates` containers.
10	Gap	§2.4	No migration strategy from existing `lib/webhooks.ts` to new event bus pattern.	Added "Migration from existing `lib/webhooks.ts`" subsection with 3-phase plan.
11	Gap	§2.10	Maintenance mode proposed extending `settings` module but didn't clarify storage location. Missing from §4 Cosmos table.	Added: stored as single document per product in existing `settings` container (no new container needed).
12	Gap	§2.11	IP rules need persistence but no container was mentioned. Missing from §4 table.	Added `ip_rules` container (pk: `/productId`) to both §2.11 and §4 table.
13	Gap	§2.9	Data Export didn't mention blob module dependency (exports written to blob storage).	Added explicit dependency note on `blob` module and `jobs` module for cleanup.
14	Gap	§5	Missing env vars for webhooks (timeout, retries) and sessions (TTL, cache TTL).	Added 4 new env vars: `WEBHOOK_DELIVERY_TIMEOUT_MS`, `WEBHOOK_MAX_RETRIES`, `SESSION_TTL_DAYS`, `SESSION_CACHE_TTL_MS`.
15	Gap	§6	Quick Reference missing MindLyst repo (`learning_multimodal_memory_agents`). Doc scope says "ByteLyst platform" which includes MindLyst.	Added MindLyst native app and web entries. Also added `cosmos-init.ts` path.
16	Gap	Appendix	Dependency graph incomplete: missing Jobs → Data Export connection, missing Blob → Data Export dependency, downstream jobs not labeled with section numbers.	Rewrote graph with all connections and section labels.
17	Gap	Overall	No "Risks & Open Questions" section — design docs should call out unknowns.	Added Appendix A with 10 risk items and mitigations.
18	Gap	TOC	Table of Contents didn't include Appendix sections.	Added Appendix A, B, C to TOC.
19	Gap	§2.5	Password reset cross-referenced "§2.6" for sessions but sessions was renumbered to §2.7 in previous edit pass.	Fixed to §2.7 (caught in prior pass).
20	Gap	§1.5	Infrastructure table was missing Swagger/OpenAPI (partially wired) and Prometheus metrics (partially enabled).	Added in prior pass — verified still present.

This document is a living brainstorm. Items will be promoted to dedicated design docs (like CLIENT_TELEMETRY_DESIGN.md) as they move into implementation.

74 KiB Raw Blame History Unescape Escape

Platform Components Roadmap — What's Built, What's Missing, What's Next

Table of Contents

1. Current Inventory

1.1 Platform-Service Modules (25 modules)

1.2 Shared Packages (13 packages)

1.3 Services

1.4 Dashboards

1.5 Infrastructure Already In Place

2. Gap Analysis — Missing Components

P0 — Foundational

2.1 Scheduled Jobs / Background Task Runner

2.2 Transactional Email & Push Delivery

2.3 Outbound Webhook Subscriptions

2.4 Async Event Bus / Internal Pub-Sub

2.5 Missing Auth Flows — Password Reset & Email Verification

2.6 Public Status Page

P1 — Operational Maturity

2.7 Session Management & Active Devices

2.8 Database Migration & Schema Evolution Tracker

2.9 Data Export & Bulk Operations

2.10 Maintenance Mode & Graceful Degradation

2.11 Rate Limit Dashboard & IP Allow/Deny Lists

P2 — Product Intelligence

2.12 A/B Testing & Experiments Framework

2.13 Analytics Aggregation Pipeline

2.14 In-App Feedback & Support Widget

2.15 User Impersonation / Admin Shadow Mode

2.16 Changelog & In-App Release Notes

P3 — Scale & Polish

2.17 CDN & Asset Pipeline

2.18 API Versioning Strategy

2.19 OpenAPI / Auto-Generated API Docs

2.20 Localization / i18n Service

2.21 Full-Text Search

2.22 Multi-Tenant Workspace / Org / Team Management

2.23 Data Retention & Lifecycle Policies

2.24 Automated Backup & Point-in-Time Restore

2.25 Billing Dunning & Payment Recovery

3. Implementation Priority Matrix

4. New Cosmos Containers & Cost Impact

5. New Environment Variables

6. Quick Reference — Where Things Live

Appendix A: Risks & Open Questions

Appendix B: Component Dependency Graph

Appendix C: Review Findings

74 KiB

Raw Blame History