How A/B Testing Works
A/B testing (also called split testing) randomly divides users into two groups: the control group sees the existing version (A), while the variant group sees the modified version (B). Performance is measured against a predefined success metric — conversion rate, click-through rate, revenue per visitor, or similar.
The A/B Testing Process
- Form a hypothesis: "Changing the CTA button color from blue to orange will increase conversion rate"
- Define success metrics: Primary metric (conversion rate) plus guardrail metrics (bounce rate, time on page)
- Calculate required sample size: Based on baseline conversion rate, minimum detectable effect, and desired statistical power
- Run the test: Randomly assign traffic, typically for at least one full week to control for day-of-week patterns
- Analyze results: Check for statistical significance (typically p 0.05) before declaring a winner
A/B Testing vs Conjoint Analysis
A/B testing measures real behavior with two specific variants in a live environment — high external validity but limited to testing one change at a time. Conjoint analysis measures stated preferences across many attribute combinations simultaneously in a survey environment — broader insight but stated rather than revealed preference.
Common A/B Testing Mistakes
- Stopping tests too early: Checking results daily and stopping as soon as significance appears inflates false positive rates ("peeking problem")
- Testing too many variables at once: Makes it impossible to know which change drove the result
- Insufficient sample size: Underpowered tests produce inconclusive or misleading results
- Ignoring seasonality: Running a test during an atypical period (holidays, promotions) skews results
Frequently Asked Questions
What is statistical significance in A/B testing?
Statistical significance (typically p 0.05, or 95% confidence) means there is less than a 5% probability that the observed difference between A and B occurred by random chance alone.
How long should an A/B test run?
Minimum one full week to capture day-of-week variation, often 2-4 weeks to reach adequate sample size and account for any novelty effects wearing off.