# Intelligent A/B Testing β€” Implementation Roadmap > **Module:** `platform-service/src/modules/ab-testing/` > **Admin UI:** `/ops/experiments/` > **Target:** AI-powered experiment management with auto-allocation, early stopping, and hypothesis generation > **Estimated Effort:** 2.5–3 weeks > **Status:** 🟑 Planning --- ## Executive Summary This roadmap delivers an **intelligent A/B testing platform** that goes beyond traditional feature flags. Unlike manual percentage rollouts, this system uses statistical algorithms for ** Thompson sampling**-based auto-allocation, **Bayesian early stopping** when variants clearly win/lose, and **LLM-powered hypothesis generation** from feature flag usage patterns. ### Key Differentiators vs. Static Feature Flags | Capability | Static Flags (Current) | Intelligent A/B Testing | | ------------------ | ---------------------- | ----------------------------------------- | | Traffic Allocation | Manual percentage | **Multi-armed bandit optimization** | | Stopping Decision | Manual monitoring | **Auto-stop at statistical significance** | | Winner Selection | Human judgment | **Bayesian probability of superiority** | | Test Duration | Fixed (often wrong) | **Dynamic based on effect size** | | Hypothesis | Human-written | **AI-generated from usage patterns** | | Sample Size | Guesswork | **Power analysis + sequential testing** | --- ## Phase 1: Core Experiment Engine (Week 1) ### 1.1 Data Model & Schemas - [ ] **1.1.1** Create `modules/ab-testing/types.ts` - [ ] `ExperimentDoc` β€” experiment definition and config - [ ] `VariantDoc` β€” variant metadata + metrics - [ ] `AssignmentDoc` β€” user β†’ variant assignments - [ ] `MetricDoc` β€” event types being tracked - [ ] `ExperimentResult` β€” statistical analysis results - [ ] Zod schemas for all inputs - [ ] **1.1.2** Add Cosmos containers to `cosmos-init.ts` - [ ] `experiments` (pk: `/productId`, TTL: 2 years for completed) - [ ] `experiment_variants` (pk: `/experimentId`) - [ ] `experiment_assignments` (pk: `/userId`, query by experiment) - [ ] `experiment_events` (pk: `/experimentId` + `/timestamp` for time-series) - [ ] `experiment_metrics` (pk: `/experimentId`, computed aggregates) ### 1.2 Assignment & Bucketing - [ ] **1.2.1** Create deterministic bucketing - [ ] Consistent hashing (userId + experimentId β†’ variant) - [ ] FNV-1a hash algorithm (same as feature flags) - [ ] Sticky assignments (user always sees same variant) - [ ] Override capability (force specific variant for QA) - [ ] **1.2.2** Assignment strategies - [ ] `random` β€” Simple randomization (control vs static) - [ ] `thompson` β€” Thompson sampling (multi-armed bandit) - [ ] `epsilon_greedy` β€” Epsilon-greedy exploration - [ ] `ucb` β€” Upper Confidence Bound algorithm - [ ] **1.2.3** Audience targeting - [ ] User property filters (platform, version, region, subscription tier) - [ ] Percentage rollout within target segment - [ ] Exclusion lists (beta users, internal accounts) ### 1.3 Event Tracking Pipeline - [ ] **1.3.1** Metric definitions - [ ] `conversion` β€” Binary (did/didn't convert) - [ ] `count` β€” Integer events (sessions, messages) - [ ] `duration` β€” Time-based (session length, task time) - [ ] `revenue` β€” Monetary (purchase amount, LTV) - [ ] `custom` β€” Arbitrary numeric values - [ ] **1.3.2** Event ingestion - [ ] `POST /ab-testing/events` batch endpoint - [ ] Client SDK: `track(event, value, metadata)` - [ ] Automatic attribution (which variant caused this event) - [ ] Deduplication (eventId + userId uniqueness) **Phase 1 Exit Criteria:** - [ ] Experiments created with multiple variants - [ ] Users consistently assigned to variants - [ ] Events tracked and attributed correctly - [ ] 20+ tests for assignment and ingestion --- ## Phase 2: Statistical Analysis Engine (Week 1–2) ### 2.1 Bayesian Inference - [ ] **2.1.1** Create `modules/ab-testing/statistics.ts` - [ ] `BetaDistribution` for conversion rates - [ ] `GammaDistribution` for count/duration metrics - [ ] `NormalDistribution` for continuous metrics - [ ] Monte Carlo simulation (10,000 samples) - [ ] **2.1.2** Probability calculations - [ ] `probabilityVariantBeatsControl(variant, control)` - [ ] `expectedLossIfChosen(variant)` - [ ] `probabilityBeatAllVariants(variant)` - [ ] **2.1.3** Credible intervals - [ ] 95% credible interval for each variant's true metric - [ ] Visualization-ready (lower, mean, upper bounds) ### 2.2 Early Stopping Rules - [ ] **2.2.1** Stopping criteria - [ ] **Winner found:** Variant has > 95% probability of beating control - [ ] **Loser clear:** Control has > 95% probability of beating variant - [ ] **Practical significance:** Minimum detectable effect not reached - [ ] **Time bound:** Max duration reached (safety limit) - [ ] **2.2.2** Auto-promotion - [ ] Auto-rollout winner to 100% when threshold hit - [ ] Notify admins via Slack/email - [ ] Create audit log entry - [ ] **2.2.3** Guardrails - [ ] Minimum sample size before early stopping (100 users/variant) - [ ] Business hours only for auto-actions - [ ] Require approval for revenue-impacting experiments ### 2.3 Thompson Sampling - [ ] **2.3.1** Multi-armed bandit implementation - [ ] Sample from posterior distributions - [ ] Assign user to variant with highest sample - [ ] Re-balance traffic every hour based on performance - [ ] **2.3.2** Exploration vs exploitation - [ ] Exploration rate decays over time - [ ] High uncertainty = more exploration - [ ] Clear winner = more traffic to winner - [ ] **2.3.3** Regret minimization - [ ] Track cumulative regret vs optimal variant - [ ] Regret bounds reporting **Phase 2 Exit Criteria:** - [ ] Bayesian probabilities calculated correctly - [ ] Early stopping triggers at appropriate thresholds - [ ] Thompson sampling re-allocates traffic dynamically - [ ] Statistical tests validate correctness --- ## Phase 3: AI-Powered Hypothesis Generation (Week 2) ### 3.1 Pattern Detection - [ ] **3.1.1** Usage pattern analysis - [ ] Analyze feature flag usage telemetry - [ ] Segment analysis (iOS vs Android, free vs pro) - [ ] Temporal patterns (day of week, time of day) - [ ] User behavior sequences (funnel analysis) - [ ] **3.1.2** Anomaly detection - [ ] Unexpected drop in feature adoption - [ ] Performance regression signals - [ ] User segment showing different behavior - [ ] **3.1.3** Opportunity identification - [ ] Underperforming features (low adoption) - [ ] High-dropoff flows - [ ] Competitor feature gaps ### 3.2 Hypothesis Generation - [ ] **3.2.1** LLM hypothesis prompts ``` Given this feature usage data: - Feature: {featureName} - Current adoption: {adoptionRate}% (baseline: {baseline}%) - Segment performance: {segmentData} - User feedback: {feedbackSamples} - Competitor analysis: {competitorFeatures} Generate experiment hypotheses: 1. Primary hypothesis: "Changing X will improve Y because..." 2. Secondary hypotheses (2-3 alternatives) 3. Expected effect size (conservative estimate) 4. Success metric recommendation 5. Risk assessment ``` - [ ] **3.2.2** Hypothesis ranking - [ ] Expected impact scoring - [ ] Implementation difficulty estimate - [ ] Statistical power prediction - [ ] Risk-adjusted expected value - [ ] **3.2.3** Suggested experiment design - [ ] Variant count recommendation - [ ] Traffic allocation suggestion - [ ] Duration estimate - [ ] Required sample size calculation ### 3.3 Auto-Experiment Suggestions - [ ] **3.3.1** Weekly AI reports - [ ] Top 5 experiment opportunities - [ ] Hypotheses with supporting evidence - [ ] Prioritized by expected impact - [ ] **3.3.2** One-click experiment creation - [ ] Pre-fill experiment from hypothesis - [ ] Suggested variants with descriptions - [ ] Pre-configured metrics **Phase 3 Exit Criteria:** - [ ] AI generates meaningful hypotheses from usage data - [ ] Hypothesis quality rated by product team (80%+ useful) - [ ] Auto-suggested experiments created in 1 click - [ ] Weekly reports generated automatically --- ## Phase 4: Admin Dashboard UI (Week 2–3) ### 4.1 Experiments List Page - [ ] **4.1.1** Create `/ops/experiments/page.tsx` - [ ] Experiment cards (status, duration, sample size) - [ ] Quick filters (running, completed, draft) - [ ] AI-generated hypothesis badge - [ ] Health indicators (traffic balance, event flow) - [ ] **4.1.2** Experiment creation wizard - [ ] Step 1: Define hypothesis (AI suggestions available) - [ ] Step 2: Create variants (name, description, config) - [ ] Step 3: Select metrics (primary + secondary) - [ ] Step 4: Audience targeting - [ ] Step 5: Traffic allocation (manual or Thompson) - [ ] Step 6: Review and launch ### 4.2 Live Experiment Dashboard - [ ] **4.2.1** Create `/ops/experiments/[id]/page.tsx` - [ ] Real-time metrics comparison - [ ] Variant performance table (conversions, counts, durations) - [ ] Bayesian probability visualization - [ ] Credible interval charts - [ ] **4.2.2** Statistical summary card - [ ] Probability of beating control (per variant) - [ ] Expected lift if implemented - [ ] Sample size progress bar - [ ] Days to significance estimate - [ ] **4.2.3** Action buttons - [ ] Adjust traffic allocation - [ ] Pause/resume experiment - [ ] Stop and declare winner - [ ] Rollout winner to 100% - [ ] Archive experiment ### 4.3 Results & Reporting - [ ] **4.3.1** Results page - [ ] Final statistical summary - [ ] Variant comparison visualization - [ ] Segment breakdown (iOS vs Android, etc.) - [ ] Confidence intervals over time - [ ] **4.3.2** AI insights panel - [ ] Why this result occurred (LLM summary) - [ ] Unexpected findings - [ ] Follow-up experiment suggestions - [ ] **4.3.3** Export capabilities - [ ] CSV export of raw data - [ ] PDF report generation - [ ] API endpoint for data warehouse sync **Phase 4 Exit Criteria:** - [ ] Full experiment lifecycle manageable in UI - [ ] Real-time stats visible and accurate - [ ] Bayesian visualizations clear to non-statisticians - [ ] Export and reporting functional --- ## Phase 5: Advanced Capabilities (Future) ### 5.1 Multi-Variate Testing - [ ] Test multiple variables simultaneously - [ ] Full factorial and fractional factorial designs - [ ] Interaction effect detection ### 5.2 Sequential Experimentation - [ ] Multi-phase experiments (qualification β†’ main β†’ validation) - [ ] Holdout groups for long-term validation - [ ] Global holdout (never-exposed users) ### 5.3 Personalization Layer - [ ] Contextual bandits (different variants for different users) - [ ] ML model for variant selection - [ ] Automatic personalization optimization ### 5.4 Experiment Coordination - [ ] Mutually exclusive experiments - [ ] Experiment priority rules - [ ] Layered experimentation (orthogonal tests) --- ## Appendix A: Data Models ### ExperimentDoc ```typescript interface ExperimentDoc { id: string; // exp_ productId: string; // partition key // Experiment definition name: string; description: string; hypothesis: string; aiGeneratedHypothesis?: boolean; // Flag for AI-suggested // Status lifecycle: draft β†’ running β†’ paused | stopped | completed status: 'draft' | 'running' | 'paused' | 'stopped' | 'completed'; // Variants controlVariantId: string; // Baseline variant variantIds: string[]; // All variant IDs // Configuration allocationStrategy: 'random' | 'thompson' | 'epsilon_greedy' | 'ucb'; targetPercent: number; // % of eligible traffic // Audience targeting targeting: { platforms?: string[]; // ios, android, web appVersions?: { min: string; max?: string }; regions?: string[]; userSegments?: string[]; // pro, free, enterprise userProperties?: Record; }; // Metrics primaryMetric: { name: string; type: 'conversion' | 'count' | 'duration' | 'revenue' | 'custom'; eventName: string; // Telemetry event to track aggregation: 'sum' | 'mean' | 'count' | 'unique'; direction: 'increase' | 'decrease'; // Is higher better? minimumDetectableEffect: number; // % change we want to detect }; secondaryMetrics: Array<{ name: string; type: 'conversion' | 'count' | 'duration' | 'revenue' | 'custom'; eventName: string; }>; // Guardrails guardrails: { minSampleSizePerVariant: number; // Default: 100 maxDurationDays: number; // Safety limit, default: 30 autoStopEnabled: boolean; winnerThreshold: number; // % probability to auto-stop, default: 95 requireApprovalFor: 'none' | 'revenue' | 'all'; }; // Scheduling startAt?: string; // Scheduled start (ISO 8601) endAt?: string; // Scheduled end or actual stop // Stats (denormalized for fast reads) totalParticipants: number; totalEvents: number; // Timestamps createdAt: string; updatedAt: string; startedAt?: string; completedAt?: string; ttl: number; // 2 years for completed } ``` ### VariantDoc ```typescript interface VariantDoc { id: string; // var_ experimentId: string; // partition key // Variant definition name: string; // "Control", "New Button Color", etc. description?: string; isControl: boolean; // Feature flag configuration flagConfig: Record; // Arbitrary config payload // Traffic allocation (dynamic for bandit strategies) currentAllocationPercent: number; // 0–100% // Statistics (real-time computed) stats: { participants: number; events: number; // Primary metric primaryMetricValue: number; // Mean or conversion rate primaryMetricStdDev?: number; // For conversion metrics conversions?: number; conversionRate?: number; // 0–1 // Bayesian posterior parameters betaAlpha?: number; // For Beta distribution betaBeta?: number; gammaShape?: number; // For Gamma distribution gammaScale?: number; }; // Bayesian results bayesianResults?: { probabilityBeatsControl: number; // 0–1 probabilityBeatsAll: number; // 0–1 expectedLiftPercent: number; // Relative to control expectedLoss: number; // Risk of choosing this variant credibleInterval: { lower: number; mean: number; upper: number; }; }; createdAt: string; updatedAt: string; } ``` ### ExperimentAssignmentDoc ```typescript interface ExperimentAssignmentDoc { id: string; // ea_ userId: string; // partition key (for user lookups) experimentId: string; variantId: string; // Assignment metadata assignedAt: string; // First assignment firstExposedAt?: string; // First actual exposure (feature use) // Context at assignment assignmentContext: { platform: string; appVersion: string; osVersion: string; deviceModel?: string; region?: string; }; // Events attributed to this assignment eventCount: number; lastEventAt?: string; // TTL: Remove after experiment completes + analysis period ttl: number; // experimentEnd + 90 days } ``` ### ExperimentEventDoc ```typescript interface ExperimentEventDoc { id: string; // ee_ experimentId: string; // partition key timestamp: string; // Sort key for time-series queries // Attribution userId: string; variantId: string; assignmentId: string; // Event details metricName: string; metricType: 'conversion' | 'count' | 'duration' | 'revenue' | 'custom'; value: number; // Numeric value // Conversion tracking (for binary metrics) converted: boolean; // For conversion metrics // Context eventMetadata?: Record; // Denormalized for filtering platform: string; appVersion: string; // TTL: Shorter for raw events ttl: number; // 90 days } ``` --- ## Implementation Tracking | Phase | Task | Status | Commit | | ----- | ----------------------------- | ------ | ------ | | 1.1 | Experiment types & schemas | βœ… | a9b2247 | | 1.1 | Cosmos containers | βœ… | a9b2247 | | 1.2 | Deterministic bucketing | βœ… | 783067e | | 1.2 | Assignment strategies | βœ… | 783067e | | 1.2 | Audience targeting | βœ… | 783067e | | 1.3 | Metric definitions | βœ… | 783067e | | 1.3 | Event ingestion | βœ… | 783067e | | 2.1 | Bayesian inference engine | βœ… | 783067e | | 2.1 | Probability calculations | βœ… | 783067e | | 2.1 | Credible intervals | βœ… | 783067e | | 2.2 | Early stopping rules | βœ… | 783067e | | 2.2 | Auto-promotion | βœ… | 783067e | | 2.2 | Guardrails | βœ… | 783067e | | 2.3 | Thompson sampling | βœ… | 783067e | | 2.3 | Exploration vs exploitation | βœ… | 783067e | | 2.3 | Regret minimization | βœ… | 783067e | | 3.1 | Pattern detection | βœ… | 44fa045 | | 3.1 | Anomaly detection | βœ… | 44fa045 | | 3.2 | Hypothesis generation prompts | βœ… | 44fa045 | | 3.2 | Hypothesis ranking | βœ… | 44fa045 | | 3.3 | Auto-experiment suggestions | βœ… | 44fa045 | | 4.1 | Experiments list page | βœ… | 44fa045 | | 4.1 | Creation wizard | βœ… | 44fa045 | | 4.2 | Live dashboard | βœ… | 44fa045 | | 4.2 | Statistical summary | βœ… | 44fa045 | | 4.3 | Results & reporting | βœ… | 44fa045 | | 4.3 | AI insights panel | βœ… | 44fa045 | **Legend:** ⬜ Not started | 🟑 In progress | βœ… Complete | ⏸️ Deferred --- ## Quick Reference for Implementing Agent **πŸ“‹ Full Roadmap:** `/Users/sd9235/code/mygh/learning_ai_common_plat/docs/roadmaps/INTELLIGENT_AB_TESTING_ROADMAP.md` **Key Files to Modify/Create:** ``` services/platform-service/ β”œβ”€β”€ src/ β”‚ β”œβ”€β”€ modules/ab-testing/ β”‚ β”‚ β”œβ”€β”€ types.ts # [1.1] Experiment, Variant, Assignment types β”‚ β”‚ β”œβ”€β”€ repository.ts # [1.2] Data access layer β”‚ β”‚ β”œβ”€β”€ bucketing.ts # [1.2] FNV-1a hash, sticky assignments β”‚ β”‚ β”œβ”€β”€ statistics.ts # [2.1] Bayesian inference, Beta/Normal distributions β”‚ β”‚ β”œβ”€β”€ allocation.ts # [2.3] Thompson sampling, bandit strategies β”‚ β”‚ β”œβ”€β”€ hypothesis-generator.ts # [3.2] LLM pattern analysis β”‚ β”‚ β”œβ”€β”€ routes.ts # [4] REST API β”‚ β”‚ └── ab-testing.test.ts # Tests β”‚ β”œβ”€β”€ lib/ β”‚ β”‚ └── cosmos-init.ts # [1.1] Add containers β”‚ └── server.ts # Register routes dashboards/admin-web/ β”œβ”€β”€ src/ β”‚ β”œβ”€β”€ app/(dashboard)/ β”‚ β”‚ β”œβ”€β”€ experiments/ β”‚ β”‚ β”‚ β”œβ”€β”€ page.tsx # [4.1] Experiments list β”‚ β”‚ β”‚ β”œβ”€β”€ new/page.tsx # [4.1] Creation wizard β”‚ β”‚ β”‚ └── [id]/ β”‚ β”‚ β”‚ └── page.tsx # [4.2] Live dashboard β”‚ β”œβ”€β”€ lib/ β”‚ β”‚ └── experiments-client.ts # API client β”‚ └── components/ β”‚ └── experiments/ # Bayesian charts, variant cards ``` **Commit Message Format:** ``` feat(ab-testing): [] ``` **Example:** ```bash git add services/platform-service/src/modules/ab-testing/ git commit -m "feat(ab-testing): add experiment types and cosmos containers [1.1]" ``` **Testing Requirements:** - Unit tests: 25+ Vitest tests for bucketing, statistics, bandit algorithms - Statistical validation: A/A tests, known distribution tests - Integration: End-to-end experiment lifecycle **Dependencies:** - Feature flags module (reuse bucketing logic) - Telemetry module (event tracking) - Azure OpenAI (hypothesis generation) --- ## Appendix B: Statistical Methods ### Bayesian A/B Testing **Conversion Metrics (Beta-Binomial):** ``` Posterior: Beta(Ξ± + conversions, Ξ² + non-conversions) Where Ξ± = Ξ² = 1 (uniform prior) Probability variant beats control: P(variant > control) = Ξ£(i=0 to n) [BetaCDF_control(i)] * [BetaPDF_variant(i)] ``` **Continuous Metrics (Normal):** ``` Posterior: Normal(ΞΌ_n, Οƒ_nΒ²) Where ΞΌ_n, Οƒ_n updated via conjugate prior Probability variant beats control via Monte Carlo sampling ``` ### Thompson Sampling ``` For each incoming user: For each variant: Sample ΞΈ_i from variant's posterior distribution Assign user to variant with max(ΞΈ_i) Update variant's posterior after observing outcome ``` ### Early Stopping ``` Stop experiment when: max_variant P(beats control) > 0.95 β†’ Winner found OR max_variant P(beats control) < 0.05 β†’ No winner OR days_running > max_duration AND samples_per_variant > min_sample_size ``` --- ## Appendix C: API Reference | Method | Endpoint | Auth | Description | | ------ | -------------------------------------- | -------- | -------------------------------- | | POST | `/ab-testing/experiments` | Admin | Create experiment | | GET | `/ab-testing/experiments` | Admin | List experiments | | GET | `/ab-testing/experiments/:id` | Admin | Get experiment details | | PATCH | `/ab-testing/experiments/:id` | Admin | Update experiment | | DELETE | `/ab-testing/experiments/:id` | Admin | Stop/archive experiment | | POST | `/ab-testing/experiments/:id/start` | Admin | Start experiment | | POST | `/ab-testing/experiments/:id/pause` | Admin | Pause experiment | | POST | `/ab-testing/experiments/:id/complete` | Admin | Complete with winner | | POST | `/ab-testing/assign` | Any auth | Get variant assignment for user | | POST | `/ab-testing/events` | Any auth | Track experiment event | | GET | `/ab-testing/experiments/:id/results` | Admin | Get statistical results | | GET | `/ab-testing/suggestions` | Admin | AI-generated experiment ideas | | POST | `/ab-testing/hypotheses` | Admin | Generate hypothesis from pattern | --- ## Appendix D: Integration Points ### With Feature Flags Module - Experiments build on feature flag infrastructure - Flag state = variant assignment - Consistent bucketing with existing flags ### With Telemetry Module - Experiment events enriched with telemetry context - Automatic metric tracking from existing events - Funnel analysis using telemetry breadcrumbs ### With Event Bus | Event | Action | | ----------------------------- | ----------------------------------- | | `ab.experiment.started` | Notify stakeholders, log audit | | `ab.experiment.completed` | Generate report, suggest follow-ups | | `ab.variant.declared_winner` | Trigger auto-rollout if enabled | | `ab.early_stopping.triggered` | Alert experiment owner | --- ## Appendix E: Cost Estimation | Component | Monthly Cost (est.) | | ---------------------------- | ------------------------ | | Cosmos DB (experiment data) | $100–200 | | LLM hypothesis generation | $50–100 (weekly reports) | | Compute (statistical engine) | $50 (negligible) | | **Total** | **$200–350/month** | --- ## Current Status - [x] **Design complete** β€” 2026-03-03 - [x] **Phase 1: Core Engine** β€” Complete - [x] **Phase 2: Statistics** β€” Complete - [x] **Phase 3: AI Hypotheses** β€” Complete - [x] **Phase 4: Admin UI** β€” Complete - [ ] **Phase 5: Advanced** β€” Future **Estimated Timeline:** COMPLETE (Phases 1–4) **Dependencies:** - Feature flags module (for assignment infrastructure) - Telemetry module (for event tracking) - Azure OpenAI (for hypothesis generation) --- _Last Updated: 2026-03-03_