Three A/B Testing Mistakes I Keep Seeing (And How To Avoid Them)

Over the past few years, I have observed many common errors people make when designing A/B tests and performing post-analysis. In this article, I want to highlight three of these mistakes and explain how they can be avoided.

Using Mann–Whitney to compare medians

The first mistake is the incorrect use of the Mann–Whitney test. This method is widely misunderstood and frequently misused, as many people treat it as a non-parametric “t-test” for medians. In fact, the Mann–Whitney test is designed to determine whether there is a shift between two distributions.

When applying the Mann–Whitney test, the hypotheses are defined as follows:

The null hypothesis.

The alternative hypothesis.

We must always consider the assumptions of the test. There are only two:

Observations are i.i.d.
The distributions have the same shape

How to compute the Mann–Whitney statistic:

Sort all observations by magnitude.
Assign ranks to all observations.
Compute the U statistics for both samples.

R — sum of all ranks for sample 1, n — number of observations in sample

R — sum of all ranks for sample 2, n — number of observations in sample 2.

Choose the minimum from these two values
Use statistical tables for the Mann-Whitney U test to find the probability of observing this value of U or lower.

**Since we now know that this test should not be used to compare medians, what should we use instead?

Fortunately, in 1945 the statistician Frank Wilcoxon introduced the signed-rank test, now known as the Wilcoxon Signed Rank Test.

The hypotheses for this test match what we originally expected:

The null hypothesis, m — medians

The alternative hypothesis, m — medians.

How to calculate the Wilcoxon Signed Rank test statistic:

For each paired observation, calculate the difference, keeping both its absolute value and sign.
Sort the absolute differences from smallest to largest and assign ranks.
Compute the test statistic:
The statistic W follows a known distribution. When n is larger than roughly 20, it is approximately normally distributed. This allows us to compute the probability of observing W under the null hypothesis and determine statistical significance.

Some intuition behind the formula:

If the median difference equals zero, about half of the signs should be positive and half negative, with no relationship between signs and ranks. If the median difference is not zero, W will tend to be large.

Using bootstrapping everywhere and for every dataset

The second mistake is applying bootstrapping all the time. I’ve often seen people bootstrap every dataset without first verifying whether bootstrapping is appropriate in that context.

The key assumption behind bootstrapping is

==The sample must be representative of the population from which it was drawn.==

If the sample is biased and poorly represents the population, the bootstrapped statistics will also be biased. That’s why it’s crucial to examine proportions across different cohorts and segments.

For example, if your sample contains only women, while your overall customer base has an equal gender split, bootstrapping is not appropriate.

A good practice is to compare the main segments in your dataset with those in the full population.

Always using default Type I and Type II error values

Last but not least is the habit of blindly using default experiment parameters. In about 95% of cases, 99% of analysts and data scientists at 95% of companies stick with defaults: a 5% Type I error rate and a 20% Type II error rate (or 80% test power).

Let’s start with why don’t we just set both Type I and Type II error rates to 0%?

==Because doing so would require an infinite sample size, meaning the experiment would never end.==

Clearly, that’s not practical. We must strike a balance between the number of samples we can collect and acceptable error rates.

I encourage people to consider all relevant product constraints.

The most convenient way to do it , create the table ,that you see below, and discuss it with product managers and people who are responsible for the product.

Typical table to make a decision. MDE — minimum detectable effect.

For a company like Netflix, even a 1% MDE can translate into substantial profit. For a small startup, that’s not true. Google, on the other hand, can easily run experiments involving tens of millions of users, making it reasonable to set the Type I error rate as low as 0.1% to gain higher confidence in the results.