- Add health-dashboard page with 6-dimension health cards and anomaly detection - Add predictive/at-risk page with user risk profiles and segmentation - Add predictive/campaigns page with campaign management and stats - Add predictive-client.ts API client with full type coverage - Update all 3 roadmaps to reflect complete implementation status
24 KiB
Intelligent A/B Testing — Implementation Roadmap
Module:
platform-service/src/modules/ab-testing/
Admin UI:/ops/experiments/
Target: AI-powered experiment management with auto-allocation, early stopping, and hypothesis generation
Estimated Effort: 2.5–3 weeks
Status: 🟡 Planning
Executive Summary
This roadmap delivers an intelligent A/B testing platform that goes beyond traditional feature flags. Unlike manual percentage rollouts, this system uses statistical algorithms for ** Thompson sampling**-based auto-allocation, Bayesian early stopping when variants clearly win/lose, and LLM-powered hypothesis generation from feature flag usage patterns.
Key Differentiators vs. Static Feature Flags
| Capability | Static Flags (Current) | Intelligent A/B Testing |
|---|---|---|
| Traffic Allocation | Manual percentage | Multi-armed bandit optimization |
| Stopping Decision | Manual monitoring | Auto-stop at statistical significance |
| Winner Selection | Human judgment | Bayesian probability of superiority |
| Test Duration | Fixed (often wrong) | Dynamic based on effect size |
| Hypothesis | Human-written | AI-generated from usage patterns |
| Sample Size | Guesswork | Power analysis + sequential testing |
Phase 1: Core Experiment Engine (Week 1)
1.1 Data Model & Schemas
- 1.1.1 Create
modules/ab-testing/types.tsExperimentDoc— experiment definition and configVariantDoc— variant metadata + metricsAssignmentDoc— user → variant assignmentsMetricDoc— event types being trackedExperimentResult— statistical analysis results- Zod schemas for all inputs
- 1.1.2 Add Cosmos containers to
cosmos-init.tsexperiments(pk:/productId, TTL: 2 years for completed)experiment_variants(pk:/experimentId)experiment_assignments(pk:/userId, query by experiment)experiment_events(pk:/experimentId+/timestampfor time-series)experiment_metrics(pk:/experimentId, computed aggregates)
1.2 Assignment & Bucketing
- 1.2.1 Create deterministic bucketing
- Consistent hashing (userId + experimentId → variant)
- FNV-1a hash algorithm (same as feature flags)
- Sticky assignments (user always sees same variant)
- Override capability (force specific variant for QA)
- 1.2.2 Assignment strategies
random— Simple randomization (control vs static)thompson— Thompson sampling (multi-armed bandit)epsilon_greedy— Epsilon-greedy explorationucb— Upper Confidence Bound algorithm
- 1.2.3 Audience targeting
- User property filters (platform, version, region, subscription tier)
- Percentage rollout within target segment
- Exclusion lists (beta users, internal accounts)
1.3 Event Tracking Pipeline
- 1.3.1 Metric definitions
conversion— Binary (did/didn't convert)count— Integer events (sessions, messages)duration— Time-based (session length, task time)revenue— Monetary (purchase amount, LTV)custom— Arbitrary numeric values
- 1.3.2 Event ingestion
POST /ab-testing/eventsbatch endpoint- Client SDK:
track(event, value, metadata) - Automatic attribution (which variant caused this event)
- Deduplication (eventId + userId uniqueness)
Phase 1 Exit Criteria:
- Experiments created with multiple variants
- Users consistently assigned to variants
- Events tracked and attributed correctly
- 20+ tests for assignment and ingestion
Phase 2: Statistical Analysis Engine (Week 1–2)
2.1 Bayesian Inference
- 2.1.1 Create
modules/ab-testing/statistics.tsBetaDistributionfor conversion ratesGammaDistributionfor count/duration metricsNormalDistributionfor continuous metrics- Monte Carlo simulation (10,000 samples)
- 2.1.2 Probability calculations
probabilityVariantBeatsControl(variant, control)expectedLossIfChosen(variant)probabilityBeatAllVariants(variant)
- 2.1.3 Credible intervals
- 95% credible interval for each variant's true metric
- Visualization-ready (lower, mean, upper bounds)
2.2 Early Stopping Rules
- 2.2.1 Stopping criteria
- Winner found: Variant has > 95% probability of beating control
- Loser clear: Control has > 95% probability of beating variant
- Practical significance: Minimum detectable effect not reached
- Time bound: Max duration reached (safety limit)
- 2.2.2 Auto-promotion
- Auto-rollout winner to 100% when threshold hit
- Notify admins via Slack/email
- Create audit log entry
- 2.2.3 Guardrails
- Minimum sample size before early stopping (100 users/variant)
- Business hours only for auto-actions
- Require approval for revenue-impacting experiments
2.3 Thompson Sampling
- 2.3.1 Multi-armed bandit implementation
- Sample from posterior distributions
- Assign user to variant with highest sample
- Re-balance traffic every hour based on performance
- 2.3.2 Exploration vs exploitation
- Exploration rate decays over time
- High uncertainty = more exploration
- Clear winner = more traffic to winner
- 2.3.3 Regret minimization
- Track cumulative regret vs optimal variant
- Regret bounds reporting
Phase 2 Exit Criteria:
- Bayesian probabilities calculated correctly
- Early stopping triggers at appropriate thresholds
- Thompson sampling re-allocates traffic dynamically
- Statistical tests validate correctness
Phase 3: AI-Powered Hypothesis Generation (Week 2)
3.1 Pattern Detection
- 3.1.1 Usage pattern analysis
- Analyze feature flag usage telemetry
- Segment analysis (iOS vs Android, free vs pro)
- Temporal patterns (day of week, time of day)
- User behavior sequences (funnel analysis)
- 3.1.2 Anomaly detection
- Unexpected drop in feature adoption
- Performance regression signals
- User segment showing different behavior
- 3.1.3 Opportunity identification
- Underperforming features (low adoption)
- High-dropoff flows
- Competitor feature gaps
3.2 Hypothesis Generation
-
3.2.1 LLM hypothesis prompts
Given this feature usage data: - Feature: {featureName} - Current adoption: {adoptionRate}% (baseline: {baseline}%) - Segment performance: {segmentData} - User feedback: {feedbackSamples} - Competitor analysis: {competitorFeatures} Generate experiment hypotheses: 1. Primary hypothesis: "Changing X will improve Y because..." 2. Secondary hypotheses (2-3 alternatives) 3. Expected effect size (conservative estimate) 4. Success metric recommendation 5. Risk assessment -
3.2.2 Hypothesis ranking
- Expected impact scoring
- Implementation difficulty estimate
- Statistical power prediction
- Risk-adjusted expected value
-
3.2.3 Suggested experiment design
- Variant count recommendation
- Traffic allocation suggestion
- Duration estimate
- Required sample size calculation
3.3 Auto-Experiment Suggestions
- 3.3.1 Weekly AI reports
- Top 5 experiment opportunities
- Hypotheses with supporting evidence
- Prioritized by expected impact
- 3.3.2 One-click experiment creation
- Pre-fill experiment from hypothesis
- Suggested variants with descriptions
- Pre-configured metrics
Phase 3 Exit Criteria:
- AI generates meaningful hypotheses from usage data
- Hypothesis quality rated by product team (80%+ useful)
- Auto-suggested experiments created in 1 click
- Weekly reports generated automatically
Phase 4: Admin Dashboard UI (Week 2–3)
4.1 Experiments List Page
- 4.1.1 Create
/ops/experiments/page.tsx- Experiment cards (status, duration, sample size)
- Quick filters (running, completed, draft)
- AI-generated hypothesis badge
- Health indicators (traffic balance, event flow)
- 4.1.2 Experiment creation wizard
- Step 1: Define hypothesis (AI suggestions available)
- Step 2: Create variants (name, description, config)
- Step 3: Select metrics (primary + secondary)
- Step 4: Audience targeting
- Step 5: Traffic allocation (manual or Thompson)
- Step 6: Review and launch
4.2 Live Experiment Dashboard
- 4.2.1 Create
/ops/experiments/[id]/page.tsx- Real-time metrics comparison
- Variant performance table (conversions, counts, durations)
- Bayesian probability visualization
- Credible interval charts
- 4.2.2 Statistical summary card
- Probability of beating control (per variant)
- Expected lift if implemented
- Sample size progress bar
- Days to significance estimate
- 4.2.3 Action buttons
- Adjust traffic allocation
- Pause/resume experiment
- Stop and declare winner
- Rollout winner to 100%
- Archive experiment
4.3 Results & Reporting
- 4.3.1 Results page
- Final statistical summary
- Variant comparison visualization
- Segment breakdown (iOS vs Android, etc.)
- Confidence intervals over time
- 4.3.2 AI insights panel
- Why this result occurred (LLM summary)
- Unexpected findings
- Follow-up experiment suggestions
- 4.3.3 Export capabilities
- CSV export of raw data
- PDF report generation
- API endpoint for data warehouse sync
Phase 4 Exit Criteria:
- Full experiment lifecycle manageable in UI
- Real-time stats visible and accurate
- Bayesian visualizations clear to non-statisticians
- Export and reporting functional
Phase 5: Advanced Capabilities (Future)
5.1 Multi-Variate Testing
- Test multiple variables simultaneously
- Full factorial and fractional factorial designs
- Interaction effect detection
5.2 Sequential Experimentation
- Multi-phase experiments (qualification → main → validation)
- Holdout groups for long-term validation
- Global holdout (never-exposed users)
5.3 Personalization Layer
- Contextual bandits (different variants for different users)
- ML model for variant selection
- Automatic personalization optimization
5.4 Experiment Coordination
- Mutually exclusive experiments
- Experiment priority rules
- Layered experimentation (orthogonal tests)
Appendix A: Data Models
ExperimentDoc
interface ExperimentDoc {
id: string; // exp_<uuid>
productId: string; // partition key
// Experiment definition
name: string;
description: string;
hypothesis: string;
aiGeneratedHypothesis?: boolean; // Flag for AI-suggested
// Status lifecycle: draft → running → paused | stopped | completed
status: 'draft' | 'running' | 'paused' | 'stopped' | 'completed';
// Variants
controlVariantId: string; // Baseline variant
variantIds: string[]; // All variant IDs
// Configuration
allocationStrategy: 'random' | 'thompson' | 'epsilon_greedy' | 'ucb';
targetPercent: number; // % of eligible traffic
// Audience targeting
targeting: {
platforms?: string[]; // ios, android, web
appVersions?: { min: string; max?: string };
regions?: string[];
userSegments?: string[]; // pro, free, enterprise
userProperties?: Record<string, string | number | boolean>;
};
// Metrics
primaryMetric: {
name: string;
type: 'conversion' | 'count' | 'duration' | 'revenue' | 'custom';
eventName: string; // Telemetry event to track
aggregation: 'sum' | 'mean' | 'count' | 'unique';
direction: 'increase' | 'decrease'; // Is higher better?
minimumDetectableEffect: number; // % change we want to detect
};
secondaryMetrics: Array<{
name: string;
type: 'conversion' | 'count' | 'duration' | 'revenue' | 'custom';
eventName: string;
}>;
// Guardrails
guardrails: {
minSampleSizePerVariant: number; // Default: 100
maxDurationDays: number; // Safety limit, default: 30
autoStopEnabled: boolean;
winnerThreshold: number; // % probability to auto-stop, default: 95
requireApprovalFor: 'none' | 'revenue' | 'all';
};
// Scheduling
startAt?: string; // Scheduled start (ISO 8601)
endAt?: string; // Scheduled end or actual stop
// Stats (denormalized for fast reads)
totalParticipants: number;
totalEvents: number;
// Timestamps
createdAt: string;
updatedAt: string;
startedAt?: string;
completedAt?: string;
ttl: number; // 2 years for completed
}
VariantDoc
interface VariantDoc {
id: string; // var_<uuid>
experimentId: string; // partition key
// Variant definition
name: string; // "Control", "New Button Color", etc.
description?: string;
isControl: boolean;
// Feature flag configuration
flagConfig: Record<string, unknown>; // Arbitrary config payload
// Traffic allocation (dynamic for bandit strategies)
currentAllocationPercent: number; // 0–100%
// Statistics (real-time computed)
stats: {
participants: number;
events: number;
// Primary metric
primaryMetricValue: number; // Mean or conversion rate
primaryMetricStdDev?: number;
// For conversion metrics
conversions?: number;
conversionRate?: number; // 0–1
// Bayesian posterior parameters
betaAlpha?: number; // For Beta distribution
betaBeta?: number;
gammaShape?: number; // For Gamma distribution
gammaScale?: number;
};
// Bayesian results
bayesianResults?: {
probabilityBeatsControl: number; // 0–1
probabilityBeatsAll: number; // 0–1
expectedLiftPercent: number; // Relative to control
expectedLoss: number; // Risk of choosing this variant
credibleInterval: {
lower: number;
mean: number;
upper: number;
};
};
createdAt: string;
updatedAt: string;
}
ExperimentAssignmentDoc
interface ExperimentAssignmentDoc {
id: string; // ea_<uuid>
userId: string; // partition key (for user lookups)
experimentId: string;
variantId: string;
// Assignment metadata
assignedAt: string; // First assignment
firstExposedAt?: string; // First actual exposure (feature use)
// Context at assignment
assignmentContext: {
platform: string;
appVersion: string;
osVersion: string;
deviceModel?: string;
region?: string;
};
// Events attributed to this assignment
eventCount: number;
lastEventAt?: string;
// TTL: Remove after experiment completes + analysis period
ttl: number; // experimentEnd + 90 days
}
ExperimentEventDoc
interface ExperimentEventDoc {
id: string; // ee_<uuid>
experimentId: string; // partition key
timestamp: string; // Sort key for time-series queries
// Attribution
userId: string;
variantId: string;
assignmentId: string;
// Event details
metricName: string;
metricType: 'conversion' | 'count' | 'duration' | 'revenue' | 'custom';
value: number; // Numeric value
// Conversion tracking (for binary metrics)
converted: boolean; // For conversion metrics
// Context
eventMetadata?: Record<string, unknown>;
// Denormalized for filtering
platform: string;
appVersion: string;
// TTL: Shorter for raw events
ttl: number; // 90 days
}
Implementation Tracking
| Phase | Task | Status | Commit |
|---|---|---|---|
| 1.1 | Experiment types & schemas | ✅ | a9b2247 |
| 1.1 | Cosmos containers | ✅ | a9b2247 |
| 1.2 | Deterministic bucketing | ✅ | 783067e |
| 1.2 | Assignment strategies | ✅ | 783067e |
| 1.2 | Audience targeting | ✅ | 783067e |
| 1.3 | Metric definitions | ✅ | 783067e |
| 1.3 | Event ingestion | ✅ | 783067e |
| 2.1 | Bayesian inference engine | ✅ | 783067e |
| 2.1 | Probability calculations | ✅ | 783067e |
| 2.1 | Credible intervals | ✅ | 783067e |
| 2.2 | Early stopping rules | ✅ | 783067e |
| 2.2 | Auto-promotion | ✅ | 783067e |
| 2.2 | Guardrails | ✅ | 783067e |
| 2.3 | Thompson sampling | ✅ | 783067e |
| 2.3 | Exploration vs exploitation | ✅ | 783067e |
| 2.3 | Regret minimization | ✅ | 783067e |
| 3.1 | Pattern detection | ✅ | 44fa045 |
| 3.1 | Anomaly detection | ✅ | 44fa045 |
| 3.2 | Hypothesis generation prompts | ✅ | 44fa045 |
| 3.2 | Hypothesis ranking | ✅ | 44fa045 |
| 3.3 | Auto-experiment suggestions | ✅ | 44fa045 |
| 4.1 | Experiments list page | ✅ | 44fa045 |
| 4.1 | Creation wizard | ✅ | 44fa045 |
| 4.2 | Live dashboard | ✅ | 44fa045 |
| 4.2 | Statistical summary | ✅ | 44fa045 |
| 4.3 | Results & reporting | ✅ | 44fa045 |
| 4.3 | AI insights panel | ✅ | 44fa045 |
Legend: ⬜ Not started | 🟡 In progress | ✅ Complete | ⏸️ Deferred
Quick Reference for Implementing Agent
📋 Full Roadmap: /Users/sd9235/code/mygh/learning_ai_common_plat/docs/roadmaps/INTELLIGENT_AB_TESTING_ROADMAP.md
Key Files to Modify/Create:
services/platform-service/
├── src/
│ ├── modules/ab-testing/
│ │ ├── types.ts # [1.1] Experiment, Variant, Assignment types
│ │ ├── repository.ts # [1.2] Data access layer
│ │ ├── bucketing.ts # [1.2] FNV-1a hash, sticky assignments
│ │ ├── statistics.ts # [2.1] Bayesian inference, Beta/Normal distributions
│ │ ├── allocation.ts # [2.3] Thompson sampling, bandit strategies
│ │ ├── hypothesis-generator.ts # [3.2] LLM pattern analysis
│ │ ├── routes.ts # [4] REST API
│ │ └── ab-testing.test.ts # Tests
│ ├── lib/
│ │ └── cosmos-init.ts # [1.1] Add containers
│ └── server.ts # Register routes
dashboards/admin-web/
├── src/
│ ├── app/(dashboard)/
│ │ ├── experiments/
│ │ │ ├── page.tsx # [4.1] Experiments list
│ │ │ ├── new/page.tsx # [4.1] Creation wizard
│ │ │ └── [id]/
│ │ │ └── page.tsx # [4.2] Live dashboard
│ ├── lib/
│ │ └── experiments-client.ts # API client
│ └── components/
│ └── experiments/ # Bayesian charts, variant cards
Commit Message Format:
feat(ab-testing): <description> [<task.code>]
Example:
git add services/platform-service/src/modules/ab-testing/
git commit -m "feat(ab-testing): add experiment types and cosmos containers [1.1]"
Testing Requirements:
- Unit tests: 25+ Vitest tests for bucketing, statistics, bandit algorithms
- Statistical validation: A/A tests, known distribution tests
- Integration: End-to-end experiment lifecycle
Dependencies:
- Feature flags module (reuse bucketing logic)
- Telemetry module (event tracking)
- Azure OpenAI (hypothesis generation)
Appendix B: Statistical Methods
Bayesian A/B Testing
Conversion Metrics (Beta-Binomial):
Posterior: Beta(α + conversions, β + non-conversions)
Where α = β = 1 (uniform prior)
Probability variant beats control:
P(variant > control) = Σ(i=0 to n) [BetaCDF_control(i)] * [BetaPDF_variant(i)]
Continuous Metrics (Normal):
Posterior: Normal(μ_n, σ_n²)
Where μ_n, σ_n updated via conjugate prior
Probability variant beats control via Monte Carlo sampling
Thompson Sampling
For each incoming user:
For each variant:
Sample θ_i from variant's posterior distribution
Assign user to variant with max(θ_i)
Update variant's posterior after observing outcome
Early Stopping
Stop experiment when:
max_variant P(beats control) > 0.95 → Winner found
OR max_variant P(beats control) < 0.05 → No winner
OR days_running > max_duration
AND samples_per_variant > min_sample_size
Appendix C: API Reference
| Method | Endpoint | Auth | Description |
|---|---|---|---|
| POST | /ab-testing/experiments |
Admin | Create experiment |
| GET | /ab-testing/experiments |
Admin | List experiments |
| GET | /ab-testing/experiments/:id |
Admin | Get experiment details |
| PATCH | /ab-testing/experiments/:id |
Admin | Update experiment |
| DELETE | /ab-testing/experiments/:id |
Admin | Stop/archive experiment |
| POST | /ab-testing/experiments/:id/start |
Admin | Start experiment |
| POST | /ab-testing/experiments/:id/pause |
Admin | Pause experiment |
| POST | /ab-testing/experiments/:id/complete |
Admin | Complete with winner |
| POST | /ab-testing/assign |
Any auth | Get variant assignment for user |
| POST | /ab-testing/events |
Any auth | Track experiment event |
| GET | /ab-testing/experiments/:id/results |
Admin | Get statistical results |
| GET | /ab-testing/suggestions |
Admin | AI-generated experiment ideas |
| POST | /ab-testing/hypotheses |
Admin | Generate hypothesis from pattern |
Appendix D: Integration Points
With Feature Flags Module
- Experiments build on feature flag infrastructure
- Flag state = variant assignment
- Consistent bucketing with existing flags
With Telemetry Module
- Experiment events enriched with telemetry context
- Automatic metric tracking from existing events
- Funnel analysis using telemetry breadcrumbs
With Event Bus
| Event | Action |
|---|---|
ab.experiment.started |
Notify stakeholders, log audit |
ab.experiment.completed |
Generate report, suggest follow-ups |
ab.variant.declared_winner |
Trigger auto-rollout if enabled |
ab.early_stopping.triggered |
Alert experiment owner |
Appendix E: Cost Estimation
| Component | Monthly Cost (est.) |
|---|---|
| Cosmos DB (experiment data) | $100–200 |
| LLM hypothesis generation | $50–100 (weekly reports) |
| Compute (statistical engine) | $50 (negligible) |
| Total | $200–350/month |
Current Status
- Design complete — 2026-03-03
- Phase 1: Core Engine — Complete
- Phase 2: Statistics — Complete
- Phase 3: AI Hypotheses — Complete
- Phase 4: Admin UI — Complete
- Phase 5: Advanced — Future
Estimated Timeline: COMPLETE (Phases 1–4)
Dependencies:
- Feature flags module (for assignment infrastructure)
- Telemetry module (for event tracking)
- Azure OpenAI (for hypothesis generation)
Last Updated: 2026-03-03