learning_ai_common_plat/docs/roadmaps/INTELLIGENT_AB_TESTING_ROADMAP.md
saravanakumardb1 a31fdfe55a feat(predictive-analytics): complete admin UI for churn prediction and health scoring
- Add health-dashboard page with 6-dimension health cards and anomaly detection
- Add predictive/at-risk page with user risk profiles and segmentation
- Add predictive/campaigns page with campaign management and stats
- Add predictive-client.ts API client with full type coverage
- Update all 3 roadmaps to reflect complete implementation status
2026-03-03 13:48:37 -08:00

24 KiB
Raw Blame History

Intelligent A/B Testing — Implementation Roadmap

Module: platform-service/src/modules/ab-testing/
Admin UI: /ops/experiments/
Target: AI-powered experiment management with auto-allocation, early stopping, and hypothesis generation
Estimated Effort: 2.53 weeks
Status: 🟡 Planning


Executive Summary

This roadmap delivers an intelligent A/B testing platform that goes beyond traditional feature flags. Unlike manual percentage rollouts, this system uses statistical algorithms for ** Thompson sampling**-based auto-allocation, Bayesian early stopping when variants clearly win/lose, and LLM-powered hypothesis generation from feature flag usage patterns.

Key Differentiators vs. Static Feature Flags

Capability Static Flags (Current) Intelligent A/B Testing
Traffic Allocation Manual percentage Multi-armed bandit optimization
Stopping Decision Manual monitoring Auto-stop at statistical significance
Winner Selection Human judgment Bayesian probability of superiority
Test Duration Fixed (often wrong) Dynamic based on effect size
Hypothesis Human-written AI-generated from usage patterns
Sample Size Guesswork Power analysis + sequential testing

Phase 1: Core Experiment Engine (Week 1)

1.1 Data Model & Schemas

  • 1.1.1 Create modules/ab-testing/types.ts
    • ExperimentDoc — experiment definition and config
    • VariantDoc — variant metadata + metrics
    • AssignmentDoc — user → variant assignments
    • MetricDoc — event types being tracked
    • ExperimentResult — statistical analysis results
    • Zod schemas for all inputs
  • 1.1.2 Add Cosmos containers to cosmos-init.ts
    • experiments (pk: /productId, TTL: 2 years for completed)
    • experiment_variants (pk: /experimentId)
    • experiment_assignments (pk: /userId, query by experiment)
    • experiment_events (pk: /experimentId + /timestamp for time-series)
    • experiment_metrics (pk: /experimentId, computed aggregates)

1.2 Assignment & Bucketing

  • 1.2.1 Create deterministic bucketing
    • Consistent hashing (userId + experimentId → variant)
    • FNV-1a hash algorithm (same as feature flags)
    • Sticky assignments (user always sees same variant)
    • Override capability (force specific variant for QA)
  • 1.2.2 Assignment strategies
    • random — Simple randomization (control vs static)
    • thompson — Thompson sampling (multi-armed bandit)
    • epsilon_greedy — Epsilon-greedy exploration
    • ucb — Upper Confidence Bound algorithm
  • 1.2.3 Audience targeting
    • User property filters (platform, version, region, subscription tier)
    • Percentage rollout within target segment
    • Exclusion lists (beta users, internal accounts)

1.3 Event Tracking Pipeline

  • 1.3.1 Metric definitions
    • conversion — Binary (did/didn't convert)
    • count — Integer events (sessions, messages)
    • duration — Time-based (session length, task time)
    • revenue — Monetary (purchase amount, LTV)
    • custom — Arbitrary numeric values
  • 1.3.2 Event ingestion
    • POST /ab-testing/events batch endpoint
    • Client SDK: track(event, value, metadata)
    • Automatic attribution (which variant caused this event)
    • Deduplication (eventId + userId uniqueness)

Phase 1 Exit Criteria:

  • Experiments created with multiple variants
  • Users consistently assigned to variants
  • Events tracked and attributed correctly
  • 20+ tests for assignment and ingestion

Phase 2: Statistical Analysis Engine (Week 12)

2.1 Bayesian Inference

  • 2.1.1 Create modules/ab-testing/statistics.ts
    • BetaDistribution for conversion rates
    • GammaDistribution for count/duration metrics
    • NormalDistribution for continuous metrics
    • Monte Carlo simulation (10,000 samples)
  • 2.1.2 Probability calculations
    • probabilityVariantBeatsControl(variant, control)
    • expectedLossIfChosen(variant)
    • probabilityBeatAllVariants(variant)
  • 2.1.3 Credible intervals
    • 95% credible interval for each variant's true metric
    • Visualization-ready (lower, mean, upper bounds)

2.2 Early Stopping Rules

  • 2.2.1 Stopping criteria
    • Winner found: Variant has > 95% probability of beating control
    • Loser clear: Control has > 95% probability of beating variant
    • Practical significance: Minimum detectable effect not reached
    • Time bound: Max duration reached (safety limit)
  • 2.2.2 Auto-promotion
    • Auto-rollout winner to 100% when threshold hit
    • Notify admins via Slack/email
    • Create audit log entry
  • 2.2.3 Guardrails
    • Minimum sample size before early stopping (100 users/variant)
    • Business hours only for auto-actions
    • Require approval for revenue-impacting experiments

2.3 Thompson Sampling

  • 2.3.1 Multi-armed bandit implementation
    • Sample from posterior distributions
    • Assign user to variant with highest sample
    • Re-balance traffic every hour based on performance
  • 2.3.2 Exploration vs exploitation
    • Exploration rate decays over time
    • High uncertainty = more exploration
    • Clear winner = more traffic to winner
  • 2.3.3 Regret minimization
    • Track cumulative regret vs optimal variant
    • Regret bounds reporting

Phase 2 Exit Criteria:

  • Bayesian probabilities calculated correctly
  • Early stopping triggers at appropriate thresholds
  • Thompson sampling re-allocates traffic dynamically
  • Statistical tests validate correctness

Phase 3: AI-Powered Hypothesis Generation (Week 2)

3.1 Pattern Detection

  • 3.1.1 Usage pattern analysis
    • Analyze feature flag usage telemetry
    • Segment analysis (iOS vs Android, free vs pro)
    • Temporal patterns (day of week, time of day)
    • User behavior sequences (funnel analysis)
  • 3.1.2 Anomaly detection
    • Unexpected drop in feature adoption
    • Performance regression signals
    • User segment showing different behavior
  • 3.1.3 Opportunity identification
    • Underperforming features (low adoption)
    • High-dropoff flows
    • Competitor feature gaps

3.2 Hypothesis Generation

  • 3.2.1 LLM hypothesis prompts

    Given this feature usage data:
    - Feature: {featureName}
    - Current adoption: {adoptionRate}% (baseline: {baseline}%)
    - Segment performance: {segmentData}
    - User feedback: {feedbackSamples}
    - Competitor analysis: {competitorFeatures}
    
    Generate experiment hypotheses:
    1. Primary hypothesis: "Changing X will improve Y because..."
    2. Secondary hypotheses (2-3 alternatives)
    3. Expected effect size (conservative estimate)
    4. Success metric recommendation
    5. Risk assessment
    
  • 3.2.2 Hypothesis ranking

    • Expected impact scoring
    • Implementation difficulty estimate
    • Statistical power prediction
    • Risk-adjusted expected value
  • 3.2.3 Suggested experiment design

    • Variant count recommendation
    • Traffic allocation suggestion
    • Duration estimate
    • Required sample size calculation

3.3 Auto-Experiment Suggestions

  • 3.3.1 Weekly AI reports
    • Top 5 experiment opportunities
    • Hypotheses with supporting evidence
    • Prioritized by expected impact
  • 3.3.2 One-click experiment creation
    • Pre-fill experiment from hypothesis
    • Suggested variants with descriptions
    • Pre-configured metrics

Phase 3 Exit Criteria:

  • AI generates meaningful hypotheses from usage data
  • Hypothesis quality rated by product team (80%+ useful)
  • Auto-suggested experiments created in 1 click
  • Weekly reports generated automatically

Phase 4: Admin Dashboard UI (Week 23)

4.1 Experiments List Page

  • 4.1.1 Create /ops/experiments/page.tsx
    • Experiment cards (status, duration, sample size)
    • Quick filters (running, completed, draft)
    • AI-generated hypothesis badge
    • Health indicators (traffic balance, event flow)
  • 4.1.2 Experiment creation wizard
    • Step 1: Define hypothesis (AI suggestions available)
    • Step 2: Create variants (name, description, config)
    • Step 3: Select metrics (primary + secondary)
    • Step 4: Audience targeting
    • Step 5: Traffic allocation (manual or Thompson)
    • Step 6: Review and launch

4.2 Live Experiment Dashboard

  • 4.2.1 Create /ops/experiments/[id]/page.tsx
    • Real-time metrics comparison
    • Variant performance table (conversions, counts, durations)
    • Bayesian probability visualization
    • Credible interval charts
  • 4.2.2 Statistical summary card
    • Probability of beating control (per variant)
    • Expected lift if implemented
    • Sample size progress bar
    • Days to significance estimate
  • 4.2.3 Action buttons
    • Adjust traffic allocation
    • Pause/resume experiment
    • Stop and declare winner
    • Rollout winner to 100%
    • Archive experiment

4.3 Results & Reporting

  • 4.3.1 Results page
    • Final statistical summary
    • Variant comparison visualization
    • Segment breakdown (iOS vs Android, etc.)
    • Confidence intervals over time
  • 4.3.2 AI insights panel
    • Why this result occurred (LLM summary)
    • Unexpected findings
    • Follow-up experiment suggestions
  • 4.3.3 Export capabilities
    • CSV export of raw data
    • PDF report generation
    • API endpoint for data warehouse sync

Phase 4 Exit Criteria:

  • Full experiment lifecycle manageable in UI
  • Real-time stats visible and accurate
  • Bayesian visualizations clear to non-statisticians
  • Export and reporting functional

Phase 5: Advanced Capabilities (Future)

5.1 Multi-Variate Testing

  • Test multiple variables simultaneously
  • Full factorial and fractional factorial designs
  • Interaction effect detection

5.2 Sequential Experimentation

  • Multi-phase experiments (qualification → main → validation)
  • Holdout groups for long-term validation
  • Global holdout (never-exposed users)

5.3 Personalization Layer

  • Contextual bandits (different variants for different users)
  • ML model for variant selection
  • Automatic personalization optimization

5.4 Experiment Coordination

  • Mutually exclusive experiments
  • Experiment priority rules
  • Layered experimentation (orthogonal tests)

Appendix A: Data Models

ExperimentDoc

interface ExperimentDoc {
  id: string; // exp_<uuid>
  productId: string; // partition key

  // Experiment definition
  name: string;
  description: string;
  hypothesis: string;
  aiGeneratedHypothesis?: boolean; // Flag for AI-suggested

  // Status lifecycle: draft → running → paused | stopped | completed
  status: 'draft' | 'running' | 'paused' | 'stopped' | 'completed';

  // Variants
  controlVariantId: string; // Baseline variant
  variantIds: string[]; // All variant IDs

  // Configuration
  allocationStrategy: 'random' | 'thompson' | 'epsilon_greedy' | 'ucb';
  targetPercent: number; // % of eligible traffic

  // Audience targeting
  targeting: {
    platforms?: string[]; // ios, android, web
    appVersions?: { min: string; max?: string };
    regions?: string[];
    userSegments?: string[]; // pro, free, enterprise
    userProperties?: Record<string, string | number | boolean>;
  };

  // Metrics
  primaryMetric: {
    name: string;
    type: 'conversion' | 'count' | 'duration' | 'revenue' | 'custom';
    eventName: string; // Telemetry event to track
    aggregation: 'sum' | 'mean' | 'count' | 'unique';
    direction: 'increase' | 'decrease'; // Is higher better?
    minimumDetectableEffect: number; // % change we want to detect
  };
  secondaryMetrics: Array<{
    name: string;
    type: 'conversion' | 'count' | 'duration' | 'revenue' | 'custom';
    eventName: string;
  }>;

  // Guardrails
  guardrails: {
    minSampleSizePerVariant: number; // Default: 100
    maxDurationDays: number; // Safety limit, default: 30
    autoStopEnabled: boolean;
    winnerThreshold: number; // % probability to auto-stop, default: 95
    requireApprovalFor: 'none' | 'revenue' | 'all';
  };

  // Scheduling
  startAt?: string; // Scheduled start (ISO 8601)
  endAt?: string; // Scheduled end or actual stop

  // Stats (denormalized for fast reads)
  totalParticipants: number;
  totalEvents: number;

  // Timestamps
  createdAt: string;
  updatedAt: string;
  startedAt?: string;
  completedAt?: string;
  ttl: number; // 2 years for completed
}

VariantDoc

interface VariantDoc {
  id: string; // var_<uuid>
  experimentId: string; // partition key

  // Variant definition
  name: string; // "Control", "New Button Color", etc.
  description?: string;
  isControl: boolean;

  // Feature flag configuration
  flagConfig: Record<string, unknown>; // Arbitrary config payload

  // Traffic allocation (dynamic for bandit strategies)
  currentAllocationPercent: number; // 0100%

  // Statistics (real-time computed)
  stats: {
    participants: number;
    events: number;

    // Primary metric
    primaryMetricValue: number; // Mean or conversion rate
    primaryMetricStdDev?: number;

    // For conversion metrics
    conversions?: number;
    conversionRate?: number; // 01

    // Bayesian posterior parameters
    betaAlpha?: number; // For Beta distribution
    betaBeta?: number;

    gammaShape?: number; // For Gamma distribution
    gammaScale?: number;
  };

  // Bayesian results
  bayesianResults?: {
    probabilityBeatsControl: number; // 01
    probabilityBeatsAll: number; // 01
    expectedLiftPercent: number; // Relative to control
    expectedLoss: number; // Risk of choosing this variant
    credibleInterval: {
      lower: number;
      mean: number;
      upper: number;
    };
  };

  createdAt: string;
  updatedAt: string;
}

ExperimentAssignmentDoc

interface ExperimentAssignmentDoc {
  id: string; // ea_<uuid>
  userId: string; // partition key (for user lookups)

  experimentId: string;
  variantId: string;

  // Assignment metadata
  assignedAt: string; // First assignment
  firstExposedAt?: string; // First actual exposure (feature use)

  // Context at assignment
  assignmentContext: {
    platform: string;
    appVersion: string;
    osVersion: string;
    deviceModel?: string;
    region?: string;
  };

  // Events attributed to this assignment
  eventCount: number;
  lastEventAt?: string;

  // TTL: Remove after experiment completes + analysis period
  ttl: number; // experimentEnd + 90 days
}

ExperimentEventDoc

interface ExperimentEventDoc {
  id: string; // ee_<uuid>
  experimentId: string; // partition key
  timestamp: string; // Sort key for time-series queries

  // Attribution
  userId: string;
  variantId: string;
  assignmentId: string;

  // Event details
  metricName: string;
  metricType: 'conversion' | 'count' | 'duration' | 'revenue' | 'custom';
  value: number; // Numeric value

  // Conversion tracking (for binary metrics)
  converted: boolean; // For conversion metrics

  // Context
  eventMetadata?: Record<string, unknown>;

  // Denormalized for filtering
  platform: string;
  appVersion: string;

  // TTL: Shorter for raw events
  ttl: number; // 90 days
}

Implementation Tracking

Phase Task Status Commit
1.1 Experiment types & schemas a9b2247
1.1 Cosmos containers a9b2247
1.2 Deterministic bucketing 783067e
1.2 Assignment strategies 783067e
1.2 Audience targeting 783067e
1.3 Metric definitions 783067e
1.3 Event ingestion 783067e
2.1 Bayesian inference engine 783067e
2.1 Probability calculations 783067e
2.1 Credible intervals 783067e
2.2 Early stopping rules 783067e
2.2 Auto-promotion 783067e
2.2 Guardrails 783067e
2.3 Thompson sampling 783067e
2.3 Exploration vs exploitation 783067e
2.3 Regret minimization 783067e
3.1 Pattern detection 44fa045
3.1 Anomaly detection 44fa045
3.2 Hypothesis generation prompts 44fa045
3.2 Hypothesis ranking 44fa045
3.3 Auto-experiment suggestions 44fa045
4.1 Experiments list page 44fa045
4.1 Creation wizard 44fa045
4.2 Live dashboard 44fa045
4.2 Statistical summary 44fa045
4.3 Results & reporting 44fa045
4.3 AI insights panel 44fa045

Legend: Not started | 🟡 In progress | Complete | ⏸️ Deferred


Quick Reference for Implementing Agent

📋 Full Roadmap: /Users/sd9235/code/mygh/learning_ai_common_plat/docs/roadmaps/INTELLIGENT_AB_TESTING_ROADMAP.md

Key Files to Modify/Create:

services/platform-service/
├── src/
│   ├── modules/ab-testing/
│   │   ├── types.ts              # [1.1] Experiment, Variant, Assignment types
│   │   ├── repository.ts         # [1.2] Data access layer
│   │   ├── bucketing.ts          # [1.2] FNV-1a hash, sticky assignments
│   │   ├── statistics.ts         # [2.1] Bayesian inference, Beta/Normal distributions
│   │   ├── allocation.ts         # [2.3] Thompson sampling, bandit strategies
│   │   ├── hypothesis-generator.ts # [3.2] LLM pattern analysis
│   │   ├── routes.ts             # [4] REST API
│   │   └── ab-testing.test.ts    # Tests
│   ├── lib/
│   │   └── cosmos-init.ts        # [1.1] Add containers
│   └── server.ts                 # Register routes
dashboards/admin-web/
├── src/
│   ├── app/(dashboard)/
│   │   ├── experiments/
│   │   │   ├── page.tsx          # [4.1] Experiments list
│   │   │   ├── new/page.tsx      # [4.1] Creation wizard
│   │   │   └── [id]/
│   │   │       └── page.tsx      # [4.2] Live dashboard
│   ├── lib/
│   │   └── experiments-client.ts # API client
│   └── components/
│       └── experiments/          # Bayesian charts, variant cards

Commit Message Format:

feat(ab-testing): <description> [<task.code>]

Example:

git add services/platform-service/src/modules/ab-testing/
git commit -m "feat(ab-testing): add experiment types and cosmos containers [1.1]"

Testing Requirements:

  • Unit tests: 25+ Vitest tests for bucketing, statistics, bandit algorithms
  • Statistical validation: A/A tests, known distribution tests
  • Integration: End-to-end experiment lifecycle

Dependencies:

  • Feature flags module (reuse bucketing logic)
  • Telemetry module (event tracking)
  • Azure OpenAI (hypothesis generation)

Appendix B: Statistical Methods

Bayesian A/B Testing

Conversion Metrics (Beta-Binomial):

Posterior: Beta(α + conversions, β + non-conversions)
Where α = β = 1 (uniform prior)

Probability variant beats control:
P(variant > control) = Σ(i=0 to n) [BetaCDF_control(i)] * [BetaPDF_variant(i)]

Continuous Metrics (Normal):

Posterior: Normal(μ_n, σ_n²)
Where μ_n, σ_n updated via conjugate prior

Probability variant beats control via Monte Carlo sampling

Thompson Sampling

For each incoming user:
  For each variant:
    Sample θ_i from variant's posterior distribution
  Assign user to variant with max(θ_i)

Update variant's posterior after observing outcome

Early Stopping

Stop experiment when:
  max_variant P(beats control) > 0.95  → Winner found
  OR max_variant P(beats control) < 0.05 → No winner
  OR days_running > max_duration
  AND samples_per_variant > min_sample_size

Appendix C: API Reference

Method Endpoint Auth Description
POST /ab-testing/experiments Admin Create experiment
GET /ab-testing/experiments Admin List experiments
GET /ab-testing/experiments/:id Admin Get experiment details
PATCH /ab-testing/experiments/:id Admin Update experiment
DELETE /ab-testing/experiments/:id Admin Stop/archive experiment
POST /ab-testing/experiments/:id/start Admin Start experiment
POST /ab-testing/experiments/:id/pause Admin Pause experiment
POST /ab-testing/experiments/:id/complete Admin Complete with winner
POST /ab-testing/assign Any auth Get variant assignment for user
POST /ab-testing/events Any auth Track experiment event
GET /ab-testing/experiments/:id/results Admin Get statistical results
GET /ab-testing/suggestions Admin AI-generated experiment ideas
POST /ab-testing/hypotheses Admin Generate hypothesis from pattern

Appendix D: Integration Points

With Feature Flags Module

  • Experiments build on feature flag infrastructure
  • Flag state = variant assignment
  • Consistent bucketing with existing flags

With Telemetry Module

  • Experiment events enriched with telemetry context
  • Automatic metric tracking from existing events
  • Funnel analysis using telemetry breadcrumbs

With Event Bus

Event Action
ab.experiment.started Notify stakeholders, log audit
ab.experiment.completed Generate report, suggest follow-ups
ab.variant.declared_winner Trigger auto-rollout if enabled
ab.early_stopping.triggered Alert experiment owner

Appendix E: Cost Estimation

Component Monthly Cost (est.)
Cosmos DB (experiment data) $100200
LLM hypothesis generation $50100 (weekly reports)
Compute (statistical engine) $50 (negligible)
Total $200350/month

Current Status

  • Design complete — 2026-03-03
  • Phase 1: Core Engine — Complete
  • Phase 2: Statistics — Complete
  • Phase 3: AI Hypotheses — Complete
  • Phase 4: Admin UI — Complete
  • Phase 5: Advanced — Future

Estimated Timeline: COMPLETE (Phases 14)

Dependencies:

  • Feature flags module (for assignment infrastructure)
  • Telemetry module (for event tracking)
  • Azure OpenAI (for hypothesis generation)

Last Updated: 2026-03-03