A/b test

An A/B test is an experiment that compares two versions of something-a webpage, feature, email, or any other variable-to determine which performs better. Users are randomly assigned to see either version A (the control) or version B (the variant), and their behavior is measured against a predefined metric. The version that produces better results wins.

Why it matters

Product decisions often come down to opinions. The designer prefers one approach, the PM prefers another, and the founder has a third idea. Without data, the highest-paid person's opinion wins, which isn't a reliable way to build products.

A/B testing replaces opinion with evidence. Instead of debating which headline will convert better, you test both and let user behavior decide. This approach compounds over time-teams that consistently test and learn build better products than teams that rely on intuition alone.

How it works

The mechanics are straightforward. You create two versions of whatever you're testing. Version A is typically the current state (the control), and version B is the change you want to evaluate (the variant). Users are randomly assigned to one version or the other, ensuring the groups are comparable.

You measure a specific outcome-conversion rate, click-through rate, time on page, revenue per user, or whatever metric matters for your hypothesis. After enough data accumulates, you analyze whether the difference between versions is statistically significant or just random variation.

If version B performs better with statistical confidence, you roll it out to everyone. If not, you keep version A or try a different variant.

Running a good test

Start with a clear hypothesis. "I think changing the button color will increase conversions" is weak. "Changing the CTA from 'Sign Up' to 'Start Free Trial' will increase signup conversion by 10% because it better communicates the risk-free nature of trying the product" is better. A good hypothesis explains what you're changing, what you expect to happen, and why.

Choose the right metric. Your primary metric should directly connect to the hypothesis. Secondary metrics help you understand side effects-did improving one thing hurt another?

Calculate sample size before starting. How many users do you need to detect a meaningful difference? Running a test too short produces unreliable results. Running it too long wastes time. Sample size calculators help you determine the appropriate test duration.

Randomize properly. Users should have an equal chance of seeing either version, and the assignment should be consistent-the same user shouldn't flip between versions.

Run the test to completion. Don't peek at results and stop early when they look good. This introduces bias and inflates false positive rates. Decide on duration upfront and stick to it.

Statistical significance

Statistical significance indicates whether the observed difference is likely real or just random noise. A result is typically considered significant at the 95% confidence level, meaning there's only a 5% chance the difference occurred by random chance.

Significance doesn't mean the difference is large or important-just that it's probably real. A 0.1% improvement can be statistically significant with enough data but may not be worth the engineering effort to implement.

Conversely, lack of significance doesn't mean the versions are identical. It means you couldn't detect a difference with your sample size. The true difference might be smaller than you can reliably measure.

Common mistakes

Testing too many things at once makes it impossible to know what caused the change. If you change the headline, image, and button simultaneously, you won't know which mattered. Test one variable at a time, or use multivariate testing with appropriate sample sizes.

Stopping tests early when results look promising inflates false positives. If you peek at results daily and stop when significance is reached, you'll "discover" many effects that aren't real.

Ignoring segment effects misses important nuance. The overall result might be neutral while version B dramatically helps one user segment and dramatically hurts another. Look at results by meaningful segments.

Testing trivial changes wastes resources. Button color changes rarely move metrics meaningfully. Focus tests on changes with real potential impact-value propositions, user flows, core features.

Not having enough traffic makes testing impractical. If you need 10,000 users per variant and you get 1,000 users per month, A/B testing isn't the right approach. Use qualitative research instead.

What to test

High-impact areas for testing include:

Headlines and copy often have surprising effects. How you describe your product can dramatically affect conversion.

Calls to action including button text, placement, and design. Small changes here can compound across the funnel.

Pricing and packaging though these tests require careful consideration of revenue impact.

Onboarding flows where small friction reductions can significantly improve activation.

Feature designs when you're uncertain which approach better serves users.

Beyond simple a/b tests

Multivariate testing tests multiple variables simultaneously, showing which combinations work best. It requires much more traffic but provides richer insights.

Bandit algorithms dynamically allocate more traffic to better-performing variants, reducing the cost of testing. They trade off statistical rigor for practical optimization.

Feature flags enable testing of features with specific user segments before full rollout, combining testing with gradual deployment.

Building a testing culture

A/B testing works best as a habit, not an occasional activity. Teams that test consistently develop better intuition over time-they learn what kinds of changes matter and what's unlikely to move metrics.

Document test results, including failures. The tests that don't work teach as much as the ones that do. A repository of past tests prevents repeating experiments and accumulates organizational knowledge.

Klero helps connect A/B test results to customer feedback. When you can see not just that version B performed better but what users said about each version, you understand why the difference occurred-which informs future tests.

MODULES

INSIGHTS

Understanding a/b test: definition & best practices

A/b test

Why it matters

How it works

Running a good test

Statistical significance

Common mistakes

What to test

Beyond simple a/b tests

Building a testing culture

Start collecting feedback today

Understanding a/b test: definition & best practices

A/b test

Why it matters

How it works

Running a good test

Statistical significance

Common mistakes

What to test

Beyond simple a/b tests

Building a testing culture

Related terms

Start collecting feedback today