74 KiB
Platform Components Roadmap — What's Built, What's Missing, What's Next
Status: Living document — brainstorm + gap analysis
Last updated: 2026-02-17
Scope: All infrastructure components relevant to admin, DevOps, and product operations across the ByteLyst platform.
Repos:learning_ai_common_plat(platform-service, packages) ·learning_voice_ai_agent(dashboards, clients)
Table of Contents
- Current Inventory
- Gap Analysis — Missing Components
- Implementation Priority Matrix
- New Cosmos Containers & Cost Impact
- New Environment Variables
- Quick Reference — Where Things Live
- Appendix A: Risks & Open Questions
- Appendix B: Component Dependency Graph
- Appendix C: Review Findings
1. Current Inventory
1.1 Platform-Service Modules (25 modules)
| Category | Module | Endpoints | Description |
|---|---|---|---|
| Identity | auth |
11 routes | Login, register, refresh, SSO, profile, admin user CRUD |
| Identity | tokens |
5 routes | API token management (CRUD + validate) |
| Identity | licenses |
6 routes | License key generation, activation, device binding, validate |
| Billing | subscriptions |
5 routes | Plan management, trial tracking, period management |
| Billing | stripe |
2 routes | Inbound Stripe webhook + portal session |
| Billing | plans |
4 routes | Plan definitions (free, pro, enterprise) |
| Billing | usage |
4 routes | Usage tracking and quota enforcement |
| Billing | promos |
5 routes | Promo code creation, validation, redemption |
| Growth | invitations |
5 routes | Invitation code generation, redemption, tracking |
| Growth | referrals |
5 routes | Referral link tracking, status transitions |
| Growth | waitlist |
12 routes | Pre-launch signups, position tracking, admin batch invite, CSV export |
| Growth | public |
5 routes | Public roadmap, community voting, feature submissions |
| Content | items |
5 routes | Tracker items (bugs, features, tasks) |
| Content | comments |
4 routes | Threaded comments on items |
| Content | votes |
3 routes | User votes on items and comments |
| Content | memory |
5 routes | Memory items — create, reassign, patch, delete |
| Ops | audit |
Query | Audit log recording and admin queries |
| Ops | flags |
5 routes | Feature flags with FNV-1a deterministic rollout |
| Ops | telemetry |
9 routes | Client event ingestion, error clustering, collection policies, GDPR erasure |
| Ops | notifications |
5 routes | Device registration, notification preferences |
| Ops | settings |
6 routes | User/device settings, kill switch |
| Ops | ratelimit |
4 routes | Rate limit checking, config management |
| Ops | themes |
7 routes | Platform theming (iOS, Android, Desktop) |
| Ops | blob |
5 routes | Azure Blob Storage SAS tokens, list, delete, info |
| Registry | products |
4 routes | Multi-product registry with full lifecycle (draft → pre_launch → beta → active → sunset → disabled) |
1.2 Shared Packages (13 packages)
| Package | Purpose |
|---|---|
@bytelyst/errors |
Typed HTTP errors (400–429) |
@bytelyst/cosmos |
Cosmos DB client singleton + container registry |
@bytelyst/config |
Zod env loader, product identity, AKV resolver |
@bytelyst/auth |
JWT utilities, auth middleware, password hashing |
@bytelyst/api-client |
Fetch wrapper with auth token injection |
@bytelyst/fastify-core |
createServiceApp() factory + startService() |
@bytelyst/react-auth |
React auth context factory |
@bytelyst/logger |
Structured logging (pino-based) |
@bytelyst/testing |
Shared test mocks, Fastify inject helpers |
@bytelyst/blob |
Azure Blob Storage client + SAS helpers |
@bytelyst/extraction |
Extraction client + shared types |
@bytelyst/monitoring |
Health-check utilities |
@bytelyst/design-tokens |
Cross-platform token generator (JSON → CSS/TS/Kotlin/Swift) |
1.3 Services
| Service | Port | Description |
|---|---|---|
| platform-service | 4003 | Consolidated Fastify service (25 modules, 621 tests) |
| extraction-service | 4005 | LangExtract text extraction + Python sidecar |
| monitoring | 4004 | Health-check aggregator (all services) |
1.4 Dashboards
| Dashboard | Port | Pages |
|---|---|---|
| admin-dashboard-web | 3001 | ~25 pages — users, billing, flags, ops, telemetry, secrets, etc. |
| user-dashboard-web | 3002 | User portal — subscription, usage, settings |
| tracker-dashboard-web | 3003 | Public roadmap, issue tracker, community voting |
1.5 Infrastructure Already In Place
| Component | Status | Notes |
|---|---|---|
| Health checks | ✅ | Per-service /health + aggregated monitoring script |
| Structured logging | ✅ | Pino (Fastify) + structlog (Python) |
| Log aggregation | ✅ | Loki + Grafana (Docker Compose) |
| Reverse proxy | ✅ | Traefik (Docker Compose) |
| Secret management | ✅ | Azure Key Vault + admin CRUD UI at /ops/secrets |
| Feature flags | ✅ | FNV-1a hash, percentage rollout, admin UI |
| Client telemetry | ✅ | All platforms instrumented, admin Client Logs page |
| Rate limiting | ✅ | In-memory sliding window + configurable rules per product |
| Outbound webhooks | ⚠️ Partial | Fire-and-forget POST for 3 events (lib/webhooks.ts); no subscription model, no retry, no HMAC signing |
| Kill switch | ✅ | Per-product, checked by all clients via /settings/kill-switch |
| Audit logging | ✅ | Records admin actions, queryable from admin dashboard |
| Blob storage | ✅ | 6 containers (audio, transcripts, attachments, avatars, releases, backups), SAS tokens, admin endpoints |
| Swagger / OpenAPI | ⚠️ Partial | createServiceApp() passes swagger config; Fastify plugin wired but Zod schemas not fully connected to route definitions via type provider |
| Prometheus metrics | ⚠️ Partial | metrics: true in createServiceApp() — basic request metrics exposed; no custom business metrics, no Grafana dashboards for them |
| Product registry | ✅ | Multi-product with full status lifecycle (draft → pre_launch → beta → active → sunset → disabled), prelaunch config, custom fields |
| Admin doc browser | ✅ | /docs page with markdown viewer, search, and AI chat — browses repo documentation |
2. Gap Analysis — Missing Components
P0 — Foundational
These are blocking features that nearly every production app needs. Without them, critical operational workflows are manual or impossible.
2.1 Scheduled Jobs / Background Task Runner
Why: No way to run recurring work today. Trial expirations, subscription renewals, usage quota resets, stale data cleanup, digest emails, and report generation all require a scheduler.
Current state: Zero. All logic is request-driven (HTTP request → response).
Proposed design:
platform-service/src/modules/jobs/
├── types.ts — JobDefinition, JobRun, JobSchedule schemas
├── registry.ts — Job registry (register named jobs with cron expressions)
├── runner.ts — Tick loop: evaluate cron, run due jobs, record outcomes
├── repository.ts — Cosmos: job_definitions, job_runs containers
└── routes.ts — Admin: list jobs, trigger manually, view run history, pause/resume
Built-in jobs to ship on day 1:
| Job | Schedule | Description |
|---|---|---|
trial-expiration-check |
Every hour | Find subscriptions with status=trialing past currentPeriodEnd, transition to expired or active |
usage-quota-reset |
Daily at midnight UTC | Reset daily/monthly counters in usage_daily container |
stale-session-cleanup |
Every 6 hours | Remove expired refresh tokens and inactive sessions |
telemetry-ttl-sweep |
Daily at 3am UTC | Delete telemetry events past retention TTL (Cosmos TTL is best-effort) |
waitlist-reminder |
Weekly | Identify stale waitlist entries, mark for follow-up |
license-expiry-check |
Daily | Warn users whose licenses expire within 7 days |
Options for the runner:
- In-process tick loop (simplest):
setIntervalin platform-service, with leader election via Cosmos lease - Azure Functions timer triggers (serverless): Lower cost, built-in cron, but adds deployment complexity
- BullMQ + Redis (heavy): Best for high-throughput, but adds a Redis dependency
Recommendation: Start with in-process tick loop + Cosmos lease for leader election (avoids Redis). Migrate to Azure Functions if job volume grows.
Admin UI:
/ops/jobspage: list all registered jobs, last run status, next scheduled run- Manual trigger button per job
- Run history table with duration, outcome, error details
- Pause/resume toggle per job
Cosmos containers:
job_definitions(pk:/productId) — name, cron, enabled, lastRunAt, nextRunAtjob_runs(pk:/productId:jobName) — runId, startedAt, completedAt, status, error, metrics
2.2 Transactional Email & Push Delivery
Why: The notifications module manages device registration and preferences, but has no delivery mechanism. Notifications are database records with no way to reach users.
Current state: Device registration + preference management only. No email, no push, no SMS.
Proposed design:
platform-service/src/modules/delivery/
├── types.ts — DeliveryRequest, DeliveryLog, ChannelConfig schemas
├── channels/
│ ├── email.ts — SendGrid/Postmark adapter
│ ├── push-apns.ts — Apple Push Notification Service
│ ├── push-fcm.ts — Firebase Cloud Messaging
│ └── sms.ts — Twilio/Azure Communication Services (future)
├── renderer.ts — Template rendering (Handlebars for email bodies)
├── repository.ts — delivery_log + email_templates containers
├── dispatcher.ts — Route delivery request to correct channel(s) based on prefs
└── routes.ts — Admin: send test, view delivery log, manage templates
Email templates to ship on day 1:
| Template | Trigger | Description |
|---|---|---|
welcome |
auth.register |
Welcome email with getting-started guide |
trial-expiring |
jobs.trial-expiration-check (7d warning) |
"Your trial ends in 7 days" |
trial-expired |
jobs.trial-expiration-check |
"Your trial has ended — upgrade to continue" |
password-reset |
Future: /auth/forgot-password |
One-time reset link |
invitation |
invitations.create |
"You've been invited to join" |
waitlist-accepted |
waitlist.invite |
"You're in! Here's your access" |
payment-failed |
stripe.invoice.payment_failed |
"We couldn't charge your card" |
license-expiring |
jobs.license-expiry-check |
"Your license expires in 7 days" |
Push notification types:
| Type | Channel | Description |
|---|---|---|
dictation_reminder |
APNs + FCM | "Haven't dictated today — keep your streak!" |
feature_announcement |
APNs + FCM | Admin-triggered announcement |
subscription_change |
APNs + FCM | Plan upgraded/downgraded/expired |
Cosmos container:
delivery_log(pk:/productId:channel:yyyyMM) — id, userId, channel, template, status (sent/failed/bounced), sentAt, error
Admin UI:
/ops/deliverypage: delivery log with filters (channel, status, template, date range)- Template management: list, preview, edit (future: visual editor)
- "Send test" button for each template
- Delivery stats: sent/failed/bounced/opened (with SendGrid webhook integration)
2.3 Outbound Webhook Subscriptions
Why: Current webhooks.ts is fire-and-forget to env-var URLs with no retry, no signing, no subscriber management. External integrations (Zapier, Slack, custom) need a proper webhook subscription system.
Current state: 3 hardcoded webhook dispatchers (invitation redeemed, referral status changed, waitlist joined). No retry. No HMAC signing. No subscription management.
Proposed design:
platform-service/src/modules/webhooks/
├── types.ts — WebhookSubscription, WebhookDelivery, WebhookEvent schemas
├── repository.ts — Cosmos: webhook_subscriptions, webhook_deliveries containers
├── dispatcher.ts — Match event → subscriptions, queue delivery, HMAC-SHA256 sign
├── delivery.ts — HTTP POST with exponential backoff retry (3 attempts)
└── routes.ts — Admin CRUD for subscriptions + delivery log
Event catalog (subscribe to any combination):
| Event | Payload | Source |
|---|---|---|
user.created |
{ userId, email, plan } |
auth.register, auth.sso |
user.deleted |
{ userId } |
Admin: DELETE /auth/users/:id |
subscription.created |
{ subscriptionId, userId, plan, status } |
Registration hook |
subscription.changed |
{ subscriptionId, oldPlan, newPlan, status } |
Stripe webhook |
subscription.canceled |
{ subscriptionId, userId, reason } |
User action / Stripe |
payment.succeeded |
{ invoiceId, amount, userId } |
Stripe webhook |
payment.failed |
{ invoiceId, amount, userId, retryCount } |
Stripe webhook |
invitation.redeemed |
{ invitationId, userId } |
Invitation module |
referral.completed |
{ referralId, referrerId, referredId } |
Referral module |
waitlist.joined |
{ email, position } |
Waitlist module |
flag.toggled |
{ flagId, enabled, percentage } |
Flags module |
license.activated |
{ licenseId, userId, deviceId } |
License module |
license.expired |
{ licenseId, userId } |
Jobs: license-expiry-check |
Security:
- Every delivery signed with
X-Webhook-Signature: sha256=<HMAC>using per-subscription secret - Subscription secret generated at creation time, displayed once, rotatable
- Replay protection:
X-Webhook-Timestampheader, reject if > 5 min old
Retry policy:
- 3 attempts with exponential backoff: 10s → 60s → 300s
- After 3 failures: mark subscription as
failing, admin notification - After 10 consecutive failures: auto-disable subscription
Admin UI:
/ops/webhookspage: list subscriptions, create/edit/delete, test delivery- Delivery log: status (success/failed/retrying), response code, duration, payload preview
- Per-subscription health indicator (green/yellow/red based on recent success rate)
Cosmos containers:
webhook_subscriptions(pk:/productId) — id, url, secret, events[], enabled, failureCount, lastDeliveryAtwebhook_deliveries(pk:/subscriptionId:yyyyMM) — id, event, status, attempts[], responseCode, duration
2.4 Async Event Bus / Internal Pub-Sub
Why: Today everything is synchronous request-response. As the platform grows, many operations should be fire-and-forget: audit log writes, webhook delivery, email sending, telemetry cluster updates, usage tracking. Without decoupling, any slow downstream operation blocks the API response.
Current state: Some fire-and-forget with unhandled promise rejections (e.g., telemetry cluster updates). No formal event bus.
Proposed design:
packages/events/
├── src/
│ ├── index.ts — EventBus class, typed event definitions
│ ├── types.ts — PlatformEvent union type, EventHandler interface
│ └── memory.ts — In-memory implementation (default)
Event flow:
API route handler
→ bus.emit('user.created', { userId, email, plan })
→ [handler] audit.record()
→ [handler] webhook.dispatch()
→ [handler] email.sendWelcome()
→ [handler] analytics.track()
Implementation options:
- Phase 1: In-memory
EventEmitterwrapper with typed events (zero dependencies) - Phase 2: Azure Service Bus adapter for cross-service events
- Phase 3: Azure Event Grid for external consumer webhooks
Typed event definitions (Zod):
const PlatformEvents = {
'user.created': z.object({ userId: z.string(), email: z.string(), plan: z.string() }),
'user.deleted': z.object({ userId: z.string() }),
'subscription.changed': z.object({
subscriptionId: z.string(),
oldPlan: z.string(),
newPlan: z.string(),
}),
'payment.failed': z.object({ invoiceId: z.string(), userId: z.string() }),
// ... all events from webhook catalog
} as const;
Migration from existing lib/webhooks.ts:
- Existing
dispatchInvitationRedeemed(),dispatchReferralStatusChanged(),dispatchWaitlistJoined()become event bus subscribers - Phase 1: Register existing webhooks.ts functions as handlers on the bus
- Phase 2: Replace inline dispatch calls in routes with
bus.emit() - Phase 3: Remove
lib/webhooks.tsonce all callers migrated
Benefits:
- Audit logging becomes a subscriber, not inline code
- Webhook delivery becomes a subscriber, not inline code
- Email sending becomes a subscriber, not inline code
- New features can subscribe to events without modifying existing modules
2.5 Missing Auth Flows — Password Reset & Email Verification
Why: The auth module has login, register, SSO, and refresh — but no password reset and no email verification. These are table-stakes for any production auth system.
Current state: If a user forgets their password, there is no recovery path. Registration accepts any email without verification.
Proposed additions to auth module:
Password reset flow:
POST /auth/forgot-password— accepts{ email, productId }, generates a time-limited reset token (UUID), stores hash inpassword_reset_tokenscontainer, sends email with reset link (via delivery module §2.2)POST /auth/reset-password— accepts{ token, newPassword }, validates token, updatespasswordHash, invalidates token, optionally revokes all sessions (§2.7)
Email verification flow:
- On register: generate verification token, store in
email_verificationscontainer, send email POST /auth/verify-email— accepts{ token }, marks user email as verifiedPOST /auth/resend-verification— rate-limited, re-sends verification email- Add
emailVerified: booleanfield toUserDoc
Reset token document:
interface PasswordResetToken {
id: string; // UUID
productId: string;
userId: string;
tokenHash: string; // SHA-256 hash of the token (raw token sent via email)
expiresAt: string; // 1 hour from creation
usedAt?: string;
createdAt: string;
}
Security considerations:
- Store hash of token, not raw token (same pattern as API tokens)
- Tokens expire in 1 hour
- Rate limit: 3 reset requests per email per hour
- After successful reset, invalidate all existing sessions
- Log all reset attempts to audit
Cosmos container:
password_reset_tokens(pk:/productId) — short-lived, TTL 24h auto-expiry
Dependency: Requires email delivery (§2.2) for sending reset links and verification emails. Can ship the endpoints first with req.log.info-logged URLs for dev/testing (never console.log).
2.6 Public Status Page
Why: Users and admins need a single place to check if services are operational. The health-check script exists but has no user-facing output.
Current state: monitoring/health-check.ts polls services and prints to stdout. No persistent status, no incident history, no public URL.
Proposed design:
Option A — Self-hosted (minimal):
platform-service/src/modules/status/
├── types.ts — ServiceStatus, Incident, MaintenanceWindow schemas
├── repository.ts — Cosmos: service_status, incidents containers
├── poller.ts — Periodic health poll (reuses @bytelyst/monitoring)
└── routes.ts — Public: GET /public/status, GET /public/status/history
Option B — External (Instatus, Statuspage, or Upptime):
- Upptime (GitHub-based, free, open-source) — runs as a GitHub Action, publishes to GitHub Pages
- Better for public credibility (hosted on a separate domain)
Recommendation: Option A for internal/admin use, Option B for public-facing.
Status page data model:
| Field | Type | Description |
|---|---|---|
services |
array | Current status per service (operational/degraded/down) |
incidents |
array | Active and past incidents with timeline |
maintenanceWindows |
array | Scheduled maintenance with start/end times |
overallStatus |
enum | operational / degraded / major_outage |
lastCheckedAt |
ISO string | When the poller last ran |
Admin UI:
/ops/statuspage (or extend existing Mission Control/ops): service health cards with history sparklines- Incident management: create/update/resolve incidents with public-facing messages
- Maintenance scheduling: create windows with auto-banners
P1 — Operational Maturity
These components improve reliability, debuggability, and operational efficiency. Not launch-blocking, but critical for a team running production services.
2.7 Session Management & Active Devices
Why: Licenses track deviceIds but there's no concept of active sessions. Users can't see where they're logged in. Admins can't force-revoke a compromised session. "Sign out all devices" is impossible.
Current state: JWT tokens with expiry. No session tracking. No revocation list. Refresh tokens are stateless.
Proposed design:
platform-service/src/modules/sessions/
├── types.ts — SessionDoc, CreateSessionInput schemas
├── repository.ts — Cosmos: sessions container (pk: /userId)
├── middleware.ts — Session validation (check revocation on each request)
└── routes.ts — User: list my sessions, revoke one, revoke all
— Admin: list user sessions, force-revoke
Session document:
interface SessionDoc {
id: string; // session ID (embedded in JWT)
productId: string;
userId: string;
deviceId?: string; // linked to license device
platform: string; // ios, android, desktop, web
ipAddress: string;
userAgent: string;
lastActiveAt: string;
createdAt: string;
revokedAt?: string;
expiresAt: string;
}
Endpoints:
GET /sessions— list my active sessionsDELETE /sessions/:id— revoke specific sessionDELETE /sessions— revoke all sessions (sign out everywhere)GET /sessions/user/:userId— admin: list user's sessionsDELETE /sessions/user/:userId— admin: force-revoke all
Integration: Refresh token endpoint creates a session. Auth middleware checks session isn't revoked (Cosmos point-read by session ID, cached in-memory with short TTL).
2.8 Database Migration & Schema Evolution Tracker
Why: Cosmos DB is schemaless, but breaking changes still happen: new required fields, partition key changes, index policy updates, container renames. Without tracking, deployments are error-prone and rollbacks are impossible.
Current state: No migration tracking. Schema changes are applied ad-hoc.
Proposed design:
platform-service/src/migrations/
├── runner.ts — Run pending migrations on startup (idempotent)
├── registry.ts — List of migration files, ordered by version
└── migrations/
├── 001_add_productId_to_legacy_users.ts
├── 002_create_telemetry_containers.ts
└── ...
Migration document (in migrations container):
interface MigrationDoc {
id: string; // "001_add_productId_to_legacy_users"
productId: string; // "platform"
version: number;
description: string;
appliedAt: string;
durationMs: number;
status: 'applied' | 'failed' | 'rolled_back';
error?: string;
}
Behavior:
- On service startup, runner checks
migrationscontainer for applied versions - Runs any unapplied migrations in order
- Each migration is idempotent (safe to re-run)
- Failed migrations are recorded but don't block startup (logged as warnings)
- Admin UI:
/ops/migrationspage showing applied/pending/failed
2.9 Data Export & Bulk Operations
Why: Admins regularly need: export users as CSV, export audit logs, bulk status updates, bulk license revocation. Today these require direct database queries.
Current state: Waitlist has a CSV export endpoint. Nothing else supports bulk operations.
Proposed design:
platform-service/src/modules/exports/
├── types.ts — ExportJob, ExportFormat schemas
├── repository.ts — Cosmos: export_jobs container
├── workers/
│ ├── users.ts — Export users as CSV/JSON
│ ├── audit.ts — Export audit log
│ ├── telemetry.ts — Export telemetry events
│ ├── usage.ts — Export usage data
│ └── subscriptions.ts — Export subscriptions
└── routes.ts — POST /exports (start), GET /exports (list), GET /exports/:id/download
Flow:
- Admin POST
/api/exports→{ type: 'users', format: 'csv', filters: { plan: 'free' } } - Background job runs query, writes result to blob storage (via existing
blobmodule) - Job status updates:
pending→processing→ready/failed - Admin downloads from signed blob URL (SAS token via
@bytelyst/blob)
Dependencies: blob module (existing) for storage, jobs module (§2.1) for auto-cleanup of expired exports.
Supported exports:
- Users (with filters: plan, status, date range)
- Audit log (with filters: action, userId, date range)
- Telemetry events (with filters: platform, eventType, date range)
- Usage records (with filters: userId, date range)
- Subscriptions (with filters: plan, status)
- Licenses (with filters: status, plan)
Admin UI:
/ops/exportspage: create new export, list past exports, download links- Progress indicator for running exports
- Auto-cleanup: delete export blobs after 7 days
2.10 Maintenance Mode & Graceful Degradation
Why: Kill switch is binary (on/off per product). Need nuanced control: read-only mode, specific features disabled, custom banner messages, admin bypass, scheduled windows.
Current state: settings/kill-switch endpoint returns boolean per product. Clients check and fully disable themselves.
Proposed design:
Extend the existing settings module:
interface MaintenanceConfig {
mode: 'off' | 'read_only' | 'maintenance' | 'emergency';
message: string; // Shown to users
adminMessage?: string; // Shown to admins
bypassRoles: string[]; // Roles that can bypass (e.g., ['admin', 'super_admin'])
bypassIPs: string[]; // IP addresses that bypass
scheduledStart?: string; // ISO — for planned maintenance
scheduledEnd?: string;
affectedServices: string[]; // ['api', 'dictation', 'extraction'] or ['*']
updatedAt: string;
updatedBy: string;
}
Modes:
off— Normal operationread_only— GET requests allowed, writes blocked (for database maintenance)maintenance— All requests return 503 with message (except admin bypass)emergency— Kill switch + maintenance message + all clients show error
Endpoints:
GET /settings/maintenance— Public: check current mode + messagePUT /settings/maintenance— Admin: update mode, message, bypass rulesGET /settings/maintenance/schedule— Upcoming maintenance windows
Client integration:
- Clients poll
/settings/maintenancealongside kill-switch check - If
mode !== 'off', show banner withmessage - If
mode === 'maintenance', disable write operations with user-facing explanation
Admin UI:
- Extend existing Settings page or add
/ops/maintenance - Mode toggle (off/read-only/maintenance/emergency)
- Message editor with preview
- Schedule builder with start/end date pickers
- Bypass IP whitelist management
Storage: Maintenance config is a single document per product in the existing settings container (field: maintenanceConfig). No new Cosmos container needed.
2.11 Rate Limit Dashboard & IP Allow/Deny Lists
Why: ratelimit module exists but admins have zero visibility into who's being rate-limited, and no ability to whitelist VIP users or blacklist abusive IPs.
Current state: In-memory sliding window rate limiter with configurable rules. No persistence, no admin visibility.
Proposed design:
Extend ratelimit module:
interface RateLimitEntry {
key: string; // userId or IP
productId: string;
currentCount: number;
windowStart: string;
wasLimited: boolean;
lastLimitedAt?: string;
}
interface IPRule {
id: string;
productId: string;
ip: string; // CIDR notation supported
action: 'allow' | 'deny';
reason: string;
createdBy: string;
createdAt: string;
expiresAt?: string; // Temporary blocks
}
Additional endpoints:
GET /ratelimit/stats— Admin: top rate-limited keys, total 429s in last hour/dayGET /ratelimit/blocked— Admin: currently blocked keysPOST /ratelimit/ip-rules— Admin: add IP allow/deny ruleGET /ratelimit/ip-rules— Admin: list rulesDELETE /ratelimit/ip-rules/:id— Admin: remove rule
Admin UI:
/ops/rate-limitspage: real-time rate limit stats- Top offenders table (most 429 responses)
- IP rules management (allow/deny with expiry)
- Per-user rate limit override
Cosmos container:
ip_rules(pk:/productId) — persistent IP allow/deny rules- Rate limit stats remain in-memory (ephemeral); no persistence needed for counters
P2 — Product Intelligence
These components provide deeper insight into product health, user behavior, and experiment outcomes. They transform raw data into actionable intelligence.
2.12 A/B Testing & Experiments Framework
Why: Feature flags exist but only support on/off with percentage rollout. No variant assignment, metric collection, or statistical significance calculation.
Current state: flags module with boolean flags and FNV-1a deterministic rollout.
Proposed design:
Extend flags module or create sibling experiments module:
platform-service/src/modules/experiments/
├── types.ts — Experiment, Variant, ExperimentMetric schemas
├── repository.ts — Cosmos: experiments container
├── assignment.ts — Deterministic variant assignment (extend FNV-1a)
├── analysis.ts — Statistical significance calculation
└── routes.ts — Admin CRUD + results endpoint
Experiment document:
interface ExperimentDoc {
id: string;
productId: string;
name: string;
hypothesis: string;
status: 'draft' | 'running' | 'paused' | 'concluded';
variants: Variant[]; // [{id: 'control', weight: 50}, {id: 'treatment', weight: 50}]
targetingRules: FlagTargetingRules; // Reuse from flags module (platforms, versions, percentage)
primaryMetric: string; // e.g., 'dictation_completed_rate'
secondaryMetrics: string[];
startedAt?: string;
concludedAt?: string;
winningVariant?: string;
sampleSize: number;
results?: ExperimentResults;
}
Admin UI:
/experimentspage: list experiments, create new, view results- Results view: conversion rates per variant, confidence interval, statistical significance indicator
- "Conclude" action: pick winner, auto-convert to feature flag
2.13 Analytics Aggregation Pipeline
Why: usage tracks raw events but there are no pre-aggregated rollups. Admin dashboard charts require expensive real-time queries. DAU/WAU/MAU, retention cohorts, and funnel analysis are impossible without rollups.
Current state: Raw usage_daily records. No aggregation.
Proposed design:
platform-service/src/modules/analytics/
├── types.ts — MetricRollup, CohortEntry, FunnelStep schemas
├── repository.ts — Cosmos: analytics_rollups container
├── rollup-jobs/
│ ├── dau-wau-mau.ts — Daily/weekly/monthly active users
│ ├── retention.ts — Cohort retention (D1, D7, D14, D30)
│ ├── funnel.ts — Conversion funnels (signup → activate → dictate → subscribe)
│ └── feature-adoption.ts — Per-feature usage rates
└── routes.ts — Admin: GET /analytics/dau, /retention, /funnel, /adoption
Rollup schedule (via jobs module):
- DAU: every hour (incremental)
- WAU/MAU: daily at 1am UTC
- Retention cohorts: daily at 2am UTC
- Funnels: daily at 2:30am UTC
Key metrics:
- DAU/WAU/MAU — with breakdown by platform, plan
- Retention cohorts — "Of users who signed up in week X, what % are active in week X+1, X+4?"
- Conversion funnel — signup → first dictation → 5th dictation → subscription
- Feature adoption — % of active users using each major feature
- Revenue metrics — MRR, churn rate, ARPU, LTV (from subscriptions + Stripe data)
Admin UI:
- Extend dashboard home or create
/analyticspage - Charts: DAU/WAU/MAU line chart, retention heatmap, funnel bar chart, MRR trend
2.14 In-App Feedback & Support Widget
Why: Tracker handles issue tracking but there's no way for end users to submit feedback directly from the app. Bug reports with device context, NPS surveys, and feature requests should flow into the tracker automatically.
Current state: Public roadmap allows feature submissions and voting. No in-app feedback widget.
Proposed design:
platform-service/src/modules/feedback/
├── types.ts — FeedbackEntry, FeedbackType, DeviceContext schemas
├── repository.ts — Cosmos: feedback container (pk: /productId)
└── routes.ts — POST /feedback (authenticated), GET /feedback (admin query)
Feedback types:
bug_report— with device context, screenshot URL (blob), reproduction stepsfeature_request— auto-creates tracker item initemsmodulenps_survey— score (0-10), comment, contextgeneral— free-form text
Client integration:
- Shake-to-report (iOS/Android) or keyboard shortcut (Desktop)
- Auto-attach: device model, OS version, app version, current screen, last 10 telemetry events
- Screenshot capture (optional, privacy-respecting)
Admin UI:
/feedbackpage: list feedback with filters (type, platform, date range, NPS score range)- Quick actions: convert to tracker item, reply, dismiss
- NPS dashboard: score distribution over time, detractor/promoter breakdown
2.15 User Impersonation / Admin Shadow Mode
Why: When a user reports a bug, admins need to see exactly what they see. Without impersonation, debugging requires asking users for screenshots and steps, which is slow and unreliable.
Current state: No impersonation capability.
Proposed design:
Endpoint:
POST /auth/impersonate— Admin only. Accepts{ targetUserId }. Returns a scoped shadow token.
Shadow token properties:
- Contains
impersonatedBy: adminUserIdclaim - Read-only by default (no writes unless explicitly allowed)
- Expires in 15 minutes (non-renewable)
- All actions logged to audit with
impersonatedByfield - Visible banner in dashboard: "You are viewing as [user name] — all actions are audited"
Admin UI:
- On the user detail page (
/users/:id), add "View as User" button - Opens user dashboard in new tab with shadow token
- Impersonation sessions listed on
/ops/auditwith filter
2.16 Changelog & In-App Release Notes
Why: Users should know what changed in each release. A changelog system also serves as internal documentation and can be shown as a "What's New" modal in the app.
Current state: CHANGELOG.md exists in the repo but nothing in-app.
Proposed design:
platform-service/src/modules/changelog/
├── types.ts — ChangelogEntry, ReleaseNote schemas
├── repository.ts — Cosmos: changelog container (pk: /productId)
└── routes.ts — Public: GET /changelog (paginated)
— Admin: CRUD changelog entries
Entry document:
interface ChangelogEntry {
id: string;
productId: string;
version: string; // "1.2.0"
title: string;
body: string; // Markdown
category: 'feature' | 'improvement' | 'bugfix' | 'security';
platforms: string[]; // ['ios', 'android', 'desktop', 'web']
publishedAt?: string;
isDraft: boolean;
createdBy: string;
}
Client integration:
- App checks
GET /api/changelog?since=<lastSeenVersion>on launch - If new entries exist, show "What's New" modal
- User can dismiss;
lastSeenVersionstored in settings
Admin UI:
/changelogpage: create/edit/publish entries with Markdown editor- Preview mode before publishing
- Schedule publishing for future date
P3 — Scale & Polish
These components are important for scale, security, and developer experience, but are lower urgency.
2.17 CDN & Asset Pipeline
Why: Blob storage serves files directly from Azure. No edge caching, no image optimization, no automatic resizing for avatars/thumbnails.
Proposed approach:
- Azure CDN or Cloudflare in front of blob storage
- Image resize on upload (Sharp) for avatars: 64px, 128px, 256px variants
- Cache headers:
Cache-Control: public, max-age=31536000, immutablefor content-addressed assets - Release binaries served via CDN for faster desktop app updates
2.18 API Versioning Strategy
Why: As external consumers appear (webhook integrations, third-party tools), breaking API changes need to be managed. Today all endpoints are unversioned.
Proposed approach:
- URL prefix:
/v1/api/... - Deprecation header:
Sunset: <date>+Deprecation: true - Version lifecycle:
current→deprecated(6 months notice) →retired - OpenAPI spec generated per version
- Fastify plugin that routes to versioned handlers
2.19 OpenAPI / Auto-Generated API Docs
Why: Platform-service already passes swagger config to createServiceApp(), but Zod schemas aren't fully wired to route definitions. The admin /docs page is a markdown doc browser (not API docs). Auto-generated API docs from Zod schemas would be nearly free.
Current state: @fastify/swagger is configured with title/description but route schemas aren't connected via @fastify/type-provider-zod. Swagger UI may already be partially served but without route-level detail.
Proposed approach:
- Wire
@fastify/type-provider-zodto connect existing Zod schemas to Fastify route definitions - Verify
@fastify/swagger-uiis serving at/documentationon platform-service - Add route-level
schema: { body, querystring, params, response }using existing Zod schemas - Export OpenAPI JSON at
/documentation/json - Admin dashboard links to platform-service Swagger UI
2.20 Localization / i18n Service
Why: Centralized string management for all platforms. When adding a new language, change one place, not four codebases.
Proposed approach:
translationsCosmos container (pk:/productId:locale)- Admin UI: string management with translation status per locale
- Client SDK: fetch translations on launch, cache locally
- Fallback chain: requested locale → base locale → English
2.21 Full-Text Search
Why: Admin needs to search users by partial name/email. Users need to search memories/items. Cosmos SQL CONTAINS() is slow and doesn't rank results.
Proposed approach:
- Phase 1: Cosmos DB full-text search (preview feature, no extra cost)
- Phase 2: Azure AI Search for richer capabilities (fuzzy matching, facets, suggestions)
- Admin UI: unified search bar across entities (users, items, audit logs)
2.22 Multi-Tenant Workspace / Org / Team Management
Why: productId scopes data per product, but within a product there's no team or organization concept. Enterprise customers need: org hierarchy, team-scoped permissions, shared brains/workspaces.
Proposed design (future):
users → belong to → organizations → have → teams → own → resources
This is a major architectural expansion. Defer until enterprise tier is validated.
2.23 Data Retention & Lifecycle Policies
Why: Telemetry has TTL. Other containers don't. Old audit logs, expired sessions, redeemed promos, and stale waitlist entries accumulate forever.
Proposed approach:
- Admin-configurable retention policies per container
- Scheduled job (from §2.1) runs cleanup
- Default policies: audit (365 days), telemetry (30 days), sessions (90 days), export files (7 days)
- Admin UI:
/ops/retentionpage showing policies and next cleanup run
2.24 Automated Backup & Point-in-Time Restore
Why: Azure Cosmos DB has continuous backup, but admin needs visibility and one-click restore capability.
Proposed approach:
- Admin UI:
/ops/backupspage showing Azure backup status - Manual export to blob (scheduled job from §2.1)
- Restore button: triggers Azure Cosmos point-in-time restore API
- Cross-region replication status indicator
2.25 Billing Dunning & Payment Recovery
Why: Stripe handles retries, but the platform needs to: notify users of failed payments, offer grace periods, and eventually downgrade plans.
Proposed flow:
invoice.payment_failed→ send "payment failed" email (§2.2) + in-app banner- After 3 failures (Stripe Smart Retries) → send "final warning" email
- After grace period (7 days) → downgrade to free plan + email notification
- All transitions logged to audit
Integration: Stripe webhook handler (existing) + email delivery (§2.2) + scheduled job (§2.1) for grace period enforcement.
3. Implementation Priority Matrix
| Phase | Components | Effort | Dependencies | Unlocks |
|---|---|---|---|---|
| Sprint 1 | 2.1 Scheduled Jobs | M | None | Foundation for all time-based operations |
| Sprint 1 | 2.4 Event Bus | S | None | Decoupling for email, webhooks, audit |
| Sprint 2 | 2.2 Email Delivery | M | 2.4 Event Bus | User communication (welcome, trial expiry, payment failed) |
| Sprint 2 | 2.5 Password Reset + Email Verify | S | 2.2 Email Delivery | Auth completeness — table-stakes for production |
| Sprint 3 | 2.3 Webhook Subscriptions | M | 2.4 Event Bus | Third-party integrations, Zapier/Slack |
| Sprint 3 | 2.7 Session Management | S | None | Security (sign out everywhere, revocation) |
| Sprint 4 | 2.10 Maintenance Mode | S | None | Operational control during deployments |
| Sprint 4 | 2.9 Data Export | S | 2.1 Jobs (for blob cleanup) | Admin self-service, compliance |
| Sprint 5 | 2.13 Analytics Rollups | M | 2.1 Jobs (for rollup scheduling) | Dashboard charts, business metrics |
| Sprint 5 | 2.19 OpenAPI Docs | S | None | Developer experience, API discoverability |
| Sprint 6 | 2.6 Status Page | S | None | User trust, incident communication |
| Sprint 6 | 2.16 Changelog | S | None | User engagement, release communication |
| Sprint 7 | 2.11 Rate Limit Dashboard | S | None | Ops visibility |
| Sprint 7 | 2.25 Billing Dunning | S | 2.1 Jobs + 2.2 Email | Payment recovery automation |
| Later | 2.8, 2.12, 2.14–2.15, 2.17–2.18, 2.20–2.24 | Varies | — | Scale, polish, enterprise |
Effort key: S = Small (1–2 days), M = Medium (3–5 days), L = Large (1–2 weeks)
Critical path: Event Bus (2.4) → Email Delivery (2.2) → Password Reset (2.5). These three should be the first items built, in that order.
4. New Cosmos Containers & Cost Impact
Each new component introduces Cosmos containers. Cosmos DB Serverless charges per RU consumed + storage, so idle containers cost only storage (~$0.25/GB/month).
| Component | New Containers | Partition Key | Est. TTL | Est. Daily RU |
|---|---|---|---|---|
| 2.1 Jobs | job_definitions, job_runs |
/productId, /productId:jobName |
runs: 90d | ~50 RU (low volume) |
| 2.2 Email/Push | delivery_log, email_templates |
/productId:channel:yyyyMM, /productId |
log: 90d | ~200 RU |
| 2.3 Webhooks | webhook_subscriptions, webhook_deliveries |
/productId, /subscriptionId:yyyyMM |
deliveries: 30d | ~100 RU |
| 2.5 Password Reset | password_reset_tokens, email_verifications |
/productId, /productId |
24h auto | ~10 RU |
| 2.6 Status | service_status, incidents |
/productId, /productId |
None | ~20 RU |
| 2.7 Sessions | sessions |
/userId |
90d | ~500 RU (read-heavy) |
| 2.8 Migrations | migrations |
/productId |
None | ~5 RU (startup only) |
| 2.9 Exports | export_jobs |
/productId |
30d | ~20 RU |
| 2.12 Experiments | experiments |
/productId |
None | ~50 RU |
| 2.13 Analytics | analytics_rollups |
/productId:metric:period |
None | ~300 RU (write-heavy during rollup) |
| 2.11 IP Rules | ip_rules |
/productId |
None (manual) | ~10 RU |
| 2.14 Feedback | feedback |
/productId |
None | ~50 RU |
| 2.16 Changelog | changelog |
/productId |
None | ~10 RU |
| 2.20 i18n | translations |
/productId:locale |
None | ~100 RU (read-heavy, cacheable) |
| 2.23 Retention | retention_policies |
/productId |
None | ~5 RU |
Total new containers: ~19 (across all phases)
Existing containers: 27 (defined in cosmos-init.ts: products, users, settings, devices, notification_prefs, audit_log, feature_flags, invitation_codes, referrals, subscriptions, payments, licenses, plans, usage_daily, api_tokens, tracker_items, comments, votes, themes, waitlist, memory_items, daily_briefs, reflections, brain_insights, telemetry_events, telemetry_error_clusters, telemetry_collection_policies). Note: promos module uses Stripe API directly — no Cosmos container.
Cost impact: Minimal for Serverless tier — idle containers only consume storage. Active containers during job runs add burst RU.
Recommendation: Register all new containers in cosmos-init.ts alongside existing ones. Use TTL liberally for transient data (tokens, deliveries, job runs) to keep storage bounded.
5. New Environment Variables
New components will require additional env vars. All should be added to .env.example files in both repos and documented.
| Component | Variable | Example | Required |
|---|---|---|---|
| 2.1 Jobs | JOB_RUNNER_ENABLED |
true |
No (default: true) |
| 2.1 Jobs | JOB_TICK_INTERVAL_MS |
60000 |
No (default: 60s) |
| 2.2 Email | SENDGRID_API_KEY |
SG.xxx |
Yes (for email delivery) |
| 2.2 Email | EMAIL_FROM_ADDRESS |
noreply@lysnrai.com |
Yes |
| 2.2 Email | EMAIL_FROM_NAME |
LysnrAI |
No |
| 2.2 Push | APNS_KEY_ID |
ABC123 |
Yes (for iOS push) |
| 2.2 Push | APNS_TEAM_ID |
748N7QPX7J |
Yes |
| 2.2 Push | APNS_KEY_PATH |
./certs/AuthKey.p8 |
Yes |
| 2.2 Push | FCM_SERVICE_ACCOUNT_JSON |
{...} |
Yes (for Android push) |
| 2.5 Auth | PASSWORD_RESET_URL_BASE |
https://app.lysnrai.com/reset |
Yes |
| 2.5 Auth | EMAIL_VERIFY_URL_BASE |
https://app.lysnrai.com/verify |
Yes |
| 2.10 Maintenance | MAINTENANCE_MODE |
off |
No (default: off) |
| 2.10 Maintenance | MAINTENANCE_BYPASS_IPS |
10.0.0.1,10.0.0.2 |
No |
| 2.3 Webhooks | WEBHOOK_DELIVERY_TIMEOUT_MS |
5000 |
No (default: 5s) |
| 2.3 Webhooks | WEBHOOK_MAX_RETRIES |
3 |
No (default: 3) |
| 2.7 Sessions | SESSION_TTL_DAYS |
90 |
No (default: 90) |
| 2.7 Sessions | SESSION_CACHE_TTL_MS |
30000 |
No (default: 30s) |
| 2.19 OpenAPI | SWAGGER_UI_ENABLED |
true |
No (default: true in dev) |
Secret management: SENDGRID_API_KEY, APNS_*, and FCM_* should be added to Azure Key Vault as lysnr-sendgrid-api-key, lysnr-apns-key-id, etc. Update LYSNR_SECRETS in @bytelyst/config to include them.
6. Quick Reference — Where Things Live
| Component | Repo | Path |
|---|---|---|
| Platform-service modules | learning_ai_common_plat |
services/platform-service/src/modules/ |
| Shared packages | learning_ai_common_plat |
packages/ |
| Admin dashboard | learning_voice_ai_agent |
admin-dashboard-web/ |
| User dashboard | learning_voice_ai_agent |
user-dashboard-web/ |
| Tracker dashboard | learning_voice_ai_agent |
tracker-dashboard-web/ |
| Docker Compose | both repos | docker-compose.yml |
| Monitoring | learning_ai_common_plat |
services/monitoring/ |
| Design tokens | learning_ai_common_plat |
packages/design-tokens/ |
| MindLyst native app | learning_multimodal_memory_agents |
mindlyst-native/ (KMP + SwiftUI + Compose + Next.js) |
| MindLyst web | learning_multimodal_memory_agents |
mindlyst-native/web/ |
| Existing webhooks | learning_ai_common_plat |
services/platform-service/src/lib/webhooks.ts |
| Cosmos container defs | learning_ai_common_plat |
services/platform-service/src/lib/cosmos-init.ts |
| Telemetry design doc | learning_ai_common_plat |
docs/WINDSURF/CLIENT_TELEMETRY_DESIGN.md |
| Telemetry roadmap | learning_ai_common_plat |
docs/WINDSURF/TELEMETRY_ROADMAP.md |
| This document | learning_ai_common_plat |
docs/WINDSURF/PLATFORM_COMPONENTS_ROADMAP.md |
Appendix A: Risks & Open Questions
| # | Topic | Risk / Question | Mitigation |
|---|---|---|---|
| 1 | Leader election for jobs | In-process tick loop with Cosmos lease — what happens during deploys? Two instances may briefly both hold leases. | Cosmos lease has a built-in TTL. Use 30s lease with 10s renewal. During deploy overlap, the old instance's lease expires before the new one acquires. Jobs must be idempotent. |
| 2 | Email deliverability | SendGrid requires domain verification (SPF/DKIM/DMARC). Without it, emails land in spam. | Set up lysnrai.com domain authentication in SendGrid before shipping §2.2. Budget 1–2 days for DNS propagation. |
| 3 | Session validation latency | Checking Cosmos on every request for session revocation adds ~5–10ms per request. | In-memory cache with 30s TTL (§2.7). Revocation is eventually consistent — acceptable trade-off for most apps. Document the 30s window. |
| 4 | Cosmos container proliferation | 28 existing + 19 new = 47 containers. Serverless tier has no per-container cost, but management complexity grows. | Group related containers by module. Document all containers in cosmos-init.ts. Consider container-per-module naming convention. |
| 5 | Event bus ordering guarantees | In-memory EventEmitter has no ordering guarantees across handlers. If audit must record before webhook fires, ordering matters. |
Phase 1: Document that handlers run concurrently with no ordering. If ordering is needed, use handler priority weights or sequential mode. |
| 6 | Push notification certificates | APNs requires yearly certificate renewal. If it expires, all iOS push silently stops. | Add apns-cert-expiry-check to scheduled jobs (§2.1). Alert admin 30 days before expiry. |
| 7 | Webhook abuse | External subscribers could register slow endpoints that back up the delivery queue. | Per-subscription timeout (5s default), circuit breaker after 10 consecutive failures, auto-disable. |
| 8 | Migration rollback | Cosmos is schemaless — some migrations (e.g., partition key changes) are irreversible. | Mark migrations as reversible: true/false. Require manual approval for irreversible migrations. Always back up before running. |
| 9 | MindLyst parity | MindLyst web uses Cosmos directly (in-memory fallback). Shared components (email, sessions, webhooks) must work for MindLyst too, not just LysnrAI. | All new modules use productId for multi-product isolation. MindLyst can consume the same platform-service APIs. |
| 10 | Priority conflicts | Sprint plan assumes available engineering bandwidth. If telemetry or mobile work takes priority, these sprints slip. | Treat sprint assignments as relative ordering, not calendar commitments. Re-evaluate after each sprint. |
Appendix B: Component Dependency Graph
┌─────────────────────┐
│ Event Bus (2.4) │
└─────────┬───────────┘
│ emits to subscribers
┌───────────┼───────────┼───────────┐
│ │ │ │
▼ ▼ ▼ ▼
┌───────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐
│ Email/Push│ │ Webhook │ │ Audit Log │ │ Analytics │
│ (2.2) │ │ (2.3) │ │ (existing)│ │ (2.13) │
└─────┬─────┘ └───────────┘ └───────────┘ └───────────┘
│
│ sends
▼
┌───────────┐
│ Password │
│ Reset(2.5)│
└───────────┘
┌───────────────┐──▶┌─────────────────┐ ┌─────────────────┐
│ Scheduled │ │ Analytics │ │ Blob Storage │
│ Jobs (2.1) │ │ Rollups (2.13) │ │ (existing) │
└───────┬───────┘ └─────────────────┘ └────────┬────────┘
│ │
│ triggers on schedule ▲ writes exports
▼ │
┌───────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Trial Expiry │ │ Usage Reset │ │ Data Export │
│ (2.1 job) │ │ (2.1 job) │ │ (2.9) │
└───────────────┘ └─────────────────┘ └─────────────────┘
┌───────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Billing │──▶│ Email/Push │ │ Retention │
│ Dunning(2.25) │ │ Delivery (2.2) │ │ Cleanup (2.23) │
└───────────────┘ └─────────────────┘ └─────────────────┘
Appendix C: Review Findings
Systematic review performed 2026-02-17. All issues below have been fixed inline.
| # | Severity | Section | Finding | Fix |
|---|---|---|---|---|
| 1 | Bug | §1.3 | Test count stale: said "158+ tests" — actual count is 621 (verified via grep -c 'it(' *.test.ts). |
Updated to 621. |
| 2 | Bug | §1.1 | Endpoint column inconsistent: some modules said "CRUD" (vague, could be 4–8 routes), others had exact counts. | Replaced all "CRUD" with actual route counts. |
| 3 | Bug | §2.5 | Said "console-logged URLs for dev/testing" — violates project rule: never console.log in production code. |
Changed to req.log.info. |
| 4 | Bug | §2.12 | ExperimentDoc.targetingRules: {} — meaningless empty object type. |
Changed to FlagTargetingRules (reuse from flags module). |
| 5 | Bug | §2.3 | Webhook event user.deleted source said auth.delete — no such endpoint name. Actual route is DELETE /auth/users/:id (admin action). |
Fixed source column. |
| 6 | Bug | §4 | email_verifications container (from §2.5) missing from Cosmos table. Only password_reset_tokens was listed. |
Added email_verifications to §2.5 row. |
| 7 | Bug | §4 | Existing container count said "~25+" — actual is 27 (counted from cosmos-init.ts; promos uses Stripe API directly, no Cosmos container). |
Updated to 27 with full container list. |
| 8 | Bug | §4 | Total new containers said "~17" — after adding email_verifications and ip_rules, count is 19. |
Updated. |
| 9 | Gap | §2.2 | No clarity on email template storage strategy. renderer.ts mentioned but not whether templates are Cosmos-stored or file-based. |
Clarified: repository.ts now references delivery_log + email_templates containers. |
| 10 | Gap | §2.4 | No migration strategy from existing lib/webhooks.ts to new event bus pattern. |
Added "Migration from existing lib/webhooks.ts" subsection with 3-phase plan. |
| 11 | Gap | §2.10 | Maintenance mode proposed extending settings module but didn't clarify storage location. Missing from §4 Cosmos table. |
Added: stored as single document per product in existing settings container (no new container needed). |
| 12 | Gap | §2.11 | IP rules need persistence but no container was mentioned. Missing from §4 table. | Added ip_rules container (pk: /productId) to both §2.11 and §4 table. |
| 13 | Gap | §2.9 | Data Export didn't mention blob module dependency (exports written to blob storage). | Added explicit dependency note on blob module and jobs module for cleanup. |
| 14 | Gap | §5 | Missing env vars for webhooks (timeout, retries) and sessions (TTL, cache TTL). | Added 4 new env vars: WEBHOOK_DELIVERY_TIMEOUT_MS, WEBHOOK_MAX_RETRIES, SESSION_TTL_DAYS, SESSION_CACHE_TTL_MS. |
| 15 | Gap | §6 | Quick Reference missing MindLyst repo (learning_multimodal_memory_agents). Doc scope says "ByteLyst platform" which includes MindLyst. |
Added MindLyst native app and web entries. Also added cosmos-init.ts path. |
| 16 | Gap | Appendix | Dependency graph incomplete: missing Jobs → Data Export connection, missing Blob → Data Export dependency, downstream jobs not labeled with section numbers. | Rewrote graph with all connections and section labels. |
| 17 | Gap | Overall | No "Risks & Open Questions" section — design docs should call out unknowns. | Added Appendix A with 10 risk items and mitigations. |
| 18 | Gap | TOC | Table of Contents didn't include Appendix sections. | Added Appendix A, B, C to TOC. |
| 19 | Gap | §2.5 | Password reset cross-referenced "§2.6" for sessions but sessions was renumbered to §2.7 in previous edit pass. | Fixed to §2.7 (caught in prior pass). |
| 20 | Gap | §1.5 | Infrastructure table was missing Swagger/OpenAPI (partially wired) and Prometheus metrics (partially enabled). | Added in prior pass — verified still present. |
This document is a living brainstorm. Items will be promoted to dedicated design docs (like CLIENT_TELEMETRY_DESIGN.md) as they move into implementation.