Table of Links
-
Introduction
-
Hypothesis testing
2.1 Introduction
2.2 Bayesian statistics
2.3 Test martingales
2.4 p-values
2.5 Optional Stopping and Peeking
2.6 Combining p-values and Optional Continuation
2.7 A/B testing
-
Safe Tests
3.1 Introduction
3.2 Classical t-test
3.3 Safe t-test
3.4 χ2 -test
3.5 Safe Proportion Test
-
Safe Testing Simulations
4.1 Introduction and 4.2 Python Implementation
4.3 Comparing the t-test with the Safe t-test
4.4 Comparing the χ2 -test with the safe proportion test
-
Mixture sequential probability ratio test
5.1 Sequential Testing
5.2 Mixture SPRT
5.3 mSPRT and the safe t-test
-
Online Controlled Experiments
6.1 Safe t-test on OCE datasets
-
Vinted A/B tests and 7.1 Safe t-test for Vinted A/B tests
7.2 Safe proportion test for sample ratio mismatch
-
Conclusion and References
3 Safe Tests
3.1 Introduction
Safe testing [GHK23] is a novel method of hypothesis testing developed to address many issues with modern statistical inference. The safe in safe testing refers to the fact that the false positive rate does not increase above α in the optional continuation setting. As we will see, many safe tests also allow for optional stopping [GHK23], specifically the ones we will apply to the safe t-test and the safe proportion test. Figure 2 shows how the false positive rate of the safe t-test changes over an experiment.
Safe testing is based on E-variables or (E-test statistics), which are non-negative random variables which satisfy
Under the null hypothesis, many E-variables behave as test martingales [GHK23], which are closely related to Bayes factors.
In addition to this intuitive interpretation, E-variables provide many mathematical benefits as well. Earlier we highlighted p-values, optional stopping, and optional continuation as a few problems with classical statistical testing. We proceed now by discussing these issues in the context of E-variables.
For situations in which effect sizes are unknown or for tests with nuisance parameters, GROW may be indeterminable. However, the optimal growth can be determined relative to the unknown parameter. An E-variable with this property is known as relative GROW. These concepts will be applied in the derivations of the safe t-statistic and the safe proportion test statistic.
While there exist E-variables that are not safe under optional stopping [GHK23], A/B testing uses fairly common statistical tests for which optional stopping E-variables are available. The first such test we’ll explore is the t-test, beginning with the theory behind the classical t-test.
3.2 Classical t-test
The t-statistic in converted to a p-value using the t-distribution with ν = n+m−2 degrees of freedom,
which is then used to make a decision about the hypothesis.
The sample size for the t-test is determined by α, β, and the effect size δ. Before the data are collected, the effect size is unknown and must be estimated. After the test, the effect size can be calculated with Cohen’s d, which represents the overall difference between the groups
where sp is the pooled standard deviation
3.3 Safe t-tes
The one-sided safe t-test statistic has been shown to be GROW and the two-sided test statistic to be relative GROW [Pér+22]. Next, we discuss the χ2 test and its safe alternative.
3.4 χ2-test
The χ2 test is a classical statistical test that is used to assess the distribution of contingency table cells. A contingency table contains the frequencies of the multinomial data, allowing one to assess the similarities of the two distributions’ parameters. In the case of binomial data, the contingency table is 2×2, which will be the focus of this section.
The χ2 statistic in converted to a p-value using the χ2 distribution with (r − 1)(c − 1) degrees of freedom, where r and c are the number of rows and columns in the table. As with the classical t-test, the χ2 is not safe under optional stopping, and thus peeking can inflate their false positive rate [Xu+22]. For this reason, safe alternatives that allow anytime-valid inference have been developed, which we will explore now.
3.5 Safe Proportion Test
Next, consider the quantity
and n1 = na1 + nb1. Under H1 the joint distribution is
Combining 6, 7, and 8 and simplifying (see [TLG22] for details) gives the final expression for the relative GROW E-variable of batch size na + nb:
In the next section, we compare the safe t-test and the safe proportion test to their classical alternatives.
Author:
(1) Daniel Beasley
This paper is