Table of Links
-
Introduction
-
Hypothesis testing
2.1 Introduction
2.2 Bayesian statistics
2.3 Test martingales
2.4 p-values
2.5 Optional Stopping and Peeking
2.6 Combining p-values and Optional Continuation
2.7 A/B testing
-
Safe Tests
3.1 Introduction
3.2 Classical t-test
3.3 Safe t-test
3.4 χ2 -test
3.5 Safe Proportion Test
-
Safe Testing Simulations
4.1 Introduction and 4.2 Python Implementation
4.3 Comparing the t-test with the Safe t-test
4.4 Comparing the χ2 -test with the safe proportion test
-
Mixture sequential probability ratio test
5.1 Sequential Testing
5.2 Mixture SPRT
5.3 mSPRT and the safe t-test
-
Online Controlled Experiments
6.1 Safe t-test on OCE datasets
-
Vinted A/B tests and 7.1 Safe t-test for Vinted A/B tests
7.2 Safe proportion test for sample ratio mismatch
-
Conclusion and References
5 Mixture sequential probability ratio test
5.1 Sequential Testing
As sophisticated A/B testing infrastructure has proliferated, so too have the opportunities to peek at test results [Joh+17]. As we’ve seen, this leads to the unintended consequence of inflating the false positive rate. To take advantage of their infrastructure, then, big technology companies have begun implementing statistical methods that are valid at any time. This field of statistics is known as sequential testing, or anytime-valid inference. Sequential testing originated with Wald’s seminal paper on the subject, Sequential Tests of Statistical Hypotheses [Wal45]. Wald introduces the first sequential testing method, known as the sequential probability ratio test (SPRT). The SPRT is a one-sample test of size m that divides the sample space into three mutually exclusive regions corresponding to the decision to be taken: either accept H0, reject H0, or continue sampling. The quantity to determine the decision is the posterior probability of the data under H1 divided by the posterior probability under H0, P(D|H1)/P(D|H0). This is the well-known Bayes factor between the alternative and null hypotheses and is closely related to E-variables in safe testing [GHK23].
Wald and Wolfowitz proved that the SPRT is the optimal sequential test in terms of statistical power [WW48]. It should be noted, however, that their formulation of a sequential test is not aligned with that of safe tests. Their proof is based on dividing the probability ratio space into three regions: accept H0, reject H0, or continue sampling. Conversely, the safe t-test is optimal in terms of GROW [Pér+22], which means that the E-variable E will grow fastest when H0 is not true. The decision to reject H0 is taken when E ≥ 1/α, while the opposing decision to accept H0 can be taken at any time. Understand the differing formulations of these sequential tests and their optimality proofs should help to internalize the relative performances of the two tests.
5.2 Mixture SPRT
Developing an A/B test for sequential testing involved expanding the SPRT to function with two-sample data. This was accomplished by Johari et al. [Joh+17] who pioneered a method of A/B testing known as the mixture Sequential Probability Ratio test (mSPRT). This test has been adopted in large technology companies such as Uber and Netflix [SA23]. As with the safe t-test, the mSPRT performs optimally with granular, sequential data. The mSPRT is essentially similar to the SPRT, with a prior belief that the true parameter lies close to θ0. Let’s examine the mathematical details of this test in more depth.
We will keep the mSPRT statistic in its martingale form in order to compare the performance with the safe t-test.
Author:
(1) Daniel Beasley
This paper is available on arxiv under ATTRIBUTION-NONCOMMERCIAL-SHAREALIKE 4.0 INTERNATIONAL license.