A/B Testing

How A/B Testing Works

A/B testing (also called split testing) randomly divides users into two groups: the control group sees the existing version (A), while the variant group sees the modified version (B). Performance is measured against a predefined success metric — conversion rate, click-through rate, revenue per visitor, or similar.

The A/B Testing Process

Form a hypothesis: "Changing the CTA button color from blue to orange will increase conversion rate"
Define success metrics: Primary metric (conversion rate) plus guardrail metrics (bounce rate, time on page)
Calculate required sample size: Based on baseline conversion rate, minimum detectable effect, and desired statistical power
Run the test: Randomly assign traffic, typically for at least one full week to control for day-of-week patterns
Analyze results: Check for statistical significance (typically p 0.05) before declaring a winner

A/B Testing vs Conjoint Analysis

A/B testing measures real behavior with two specific variants in a live environment — high external validity but limited to testing one change at a time. Conjoint analysis measures stated preferences across many attribute combinations simultaneously in a survey environment — broader insight but stated rather than revealed preference.

Common A/B Testing Mistakes

Stopping tests too early: Checking results daily and stopping as soon as significance appears inflates false positive rates ("peeking problem")
Testing too many variables at once: Makes it impossible to know which change drove the result
Insufficient sample size: Underpowered tests produce inconclusive or misleading results
Ignoring seasonality: Running a test during an atypical period (holidays, promotions) skews results

Frequently Asked Questions

What is statistical significance in A/B testing?

Statistical significance (typically p 0.05, or 95% confidence) means there is less than a 5% probability that the observed difference between A and B occurred by random chance alone.

How long should an A/B test run?

Minimum one full week to capture day-of-week variation, often 2-4 weeks to reach adequate sample size and account for any novelty effects wearing off.