The Hidden Flaws In Your A/B Testing Strategy Nobody Talks About

Table of Links

Introduction
Hypothesis testing

2.1 Introduction

2.2 Bayesian statistics

2.3 Test martingales

2.4 p-values

2.5 Optional Stopping and Peeking

2.6 Combining p-values and Optional Continuation

2.7 A/B testing
Safe Tests

3.1 Introduction

3.2 Classical t-test

3.3 Safe t-test

3.4 χ2 -test

3.5 Safe Proportion Test
Safe Testing Simulations

4.1 Introduction and 4.2 Python Implementation

4.3 Comparing the t-test with the Safe t-test

4.4 Comparing the χ2 -test with the safe proportion test
Mixture sequential probability ratio test

5.1 Sequential Testing

5.2 Mixture SPRT

5.3 mSPRT and the safe t-test
Online Controlled Experiments

6.1 Safe t-test on OCE datasets
Vinted A/B tests and 7.1 Safe t-test for Vinted A/B tests

7.2 Safe proportion test for sample ratio mismatch
Conclusion and References

2.6 Combining p-values and Optional Continuation

Combining p-values has been a subject of debate since their origins with Pearson and Fisher [HR18]. These methods are often applied for meta-analysis for multiple experiments. Various methods exist for different contexts, and it is not always clear which method should be used in a given situation. Safe testing provides a simple, intuitive way to combine the results of many experiments.

Figure 1: False positive probability for the classical t-test for α = 0.01, 0.5, 0.1 .

In the section on peeking, it was mentioned that experimenters may want to make a decision about the experiment results based on an intermediate observed effect size. With traditional statistical testing, the observed results are not statistically valid, and hence correct conclusions cannot be drawn. Safe testing, however, allows the experimenter to take the decision to continue a test if more results are needed to observe a significant effect.

2.7 A/B testing

A/B testing at first appears as a simple application of statistical tests; however, there are nuances that are incredibly relevant to experimenters. A typical A/B test will have automated measurements of tens or possibly hundreds of metrics. Consider a test in which an experimenter wishes to measure a new feature’s impact on the impact on sales on their website. The target metric for this experiment may be total sales per user. In addition to testing the feature’s impact on the total sales, they may wish to see more engagement from users that did not buy anything. This is because higher engagement with the platform can increase its value to users. Therefore, monitoring secondary metrics, such as the number of favourited items per user, the time spent on the platform, and the proportion of searches that lead to sales may give additional information about the performance of the feature. There may, however, be unintended consequences of the feature. There may be a bug that causes the website to crash on certain browsers, or the feature may cannibalize sales of cheaper products by showing more expensive ones. It is therefore crucial to monitor so-called guardrail metrics to ensure that the feature is working as intended.

Aside from the metrics in the experiment, there are other factors to consider when evaluating results. Most statistical tests assume data are independent and identically distributed. However, a new feature may attract interest from curious users, leading to unreliable metrics. This is known as the novelty effect, and may bias the results of a test. Another point of consideration is in the time it takes for metrics to converge. Some metrics, such as the number of items viewed after a search, give instantaneous results. A metric such as the proportion of users who make a purchase may take several days to converge. This is because they may be exposed to a test while browsing the products, and return several days later to make the purchase. This time between exposure to a test and its realization can make some metrics unreliable in the short-term.

A final challenge to large-scale A/B testing concerns the random assignment of users to variants. Each experiment has an associated probability for users to be assigned to either the control or test group. The results of the user’s session are recorded in a database before being aggregated over the course of metric calculations. Issues in this process can lead to unequal samples in the control and test group. This is known as a sample ratio mismatch (SRM) and can indicate that the test results are biased, and therefore unreliable. It is therefore important for experimenters to continuously monitor the sample ratio of their A/B tests in order to stop erroneous experiments.

Having discussed A/B testing and the inflexibility of traditional statistical testing, we now introduce safe testing and how it can be applied to solve these issues.

Author:

(1) Daniel Beasley

This paper is available on arxiv under ATTRIBUTION-NONCOMMERCIAL-SHAREALIKE 4.0 INTERNATIONAL license.

The Hidden Flaws in Your A/B Testing Strategy Nobody Talks About | HackerNoon

Table of Links

2.6 Combining p-values and Optional Continuation

2.7 A/B testing

Leave a Reply Cancel reply

Stay Connected

Latest News

PCIC Model Design: Category-Level Repurchase Prediction and Frequency‑Recency Item Ranking | HackerNoon

Kendall Jenner secretly drops $23 million on Montecito estate with horse stables

SMIC becomes the world’s second largest wafer foundry · TechNode

Today's NYT Mini Crossword Answers for Aug. 12 – CNET

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

Table of Links

2.6 Combining p-values and Optional Continuation

2.7 A/B testing

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News