How Safe Tests Reduce Sample Sizes Without Compromising Statistical Validity

Table of Links

Introduction
Hypothesis testing

2.1 Introduction

2.2 Bayesian statistics

2.3 Test martingales

2.4 p-values

2.5 Optional Stopping and Peeking

2.6 Combining p-values and Optional Continuation

2.7 A/B testing
Safe Tests

3.1 Introduction

3.2 Classical t-test

3.3 Safe t-test

3.4 χ2 -test

3.5 Safe Proportion Test
Safe Testing Simulations

4.1 Introduction and 4.2 Python Implementation

4.3 Comparing the t-test with the Safe t-test

4.4 Comparing the χ2 -test with the safe proportion test
Mixture sequential probability ratio test

5.1 Sequential Testing

5.2 Mixture SPRT

5.3 mSPRT and the safe t-test
Online Controlled Experiments

6.1 Safe t-test on OCE datasets
Vinted A/B tests and 7.1 Safe t-test for Vinted A/B tests

7.2 Safe proportion test for sample ratio mismatch
Conclusion and References

4 Safe Testing Simulations

4.1 Introduction

In this section, we compare the classical t-test with the safe t-test, and the χ2 test with the safe proportion test. A thorough library for safe testing has been developed in R [LTT20]. With the goal of increasing adoption in the field of data science, we ported the code for the safe t-test and the safe proportion test into Python.

4.2 Python Implementation

While the logic of the safe t-test remains the same, there were a number of inefficiencies in the original code that needed to be addressed in order to work with large sample sizes. The improvements are detailed here.

The first improvement comes in determining the sample size required for a batch process of the data. The original function performs a linear search from 1 to an arbitrary high number. For each possible sample size in the range, the function calculates the E-value based on the sample sizes, degrees of freedom, and the effect size. The loop breaks when the E-value is greater than 1/α. Since this is a monotonically increasing function, a binary search speeds up the calculation considerably, reducing the computational complexity from O(n) to O(log n). This optimization proved to be necessary when working with millions of samples.

The next speed improvement necessary is calculating the stopping time for a power of 1 − β. This is determined through simulation of data differing by the minimal effect size. Over the course of N simulations, data of length m are individually streamed to determine the point at which the E-value crosses 1/α. Once again, this process is done through a linear search. To optimize this function, the calculation of the martingale is parallelized over the whole vector of length m. The computational complexity remains O(Nm), but the vector computation takes place in Numpy code, as opposed to a Python loop. Numpy code is written in C, hence the calculation is much faster.

The final modification is not in reducing computational complexity, but in improving the capabilities of the safe proportion test. This test was written in R as a two-sample test with fixed batch sizes. For our use case, a one-sample test with variable batch sizes was required to detect sample mismatch ratio, and was therefore developed for the Python package.

4.3 Comparing the t-test with the Safe t-test

The most straightforward way to understand the safe t-test is to compare it with its classical alternative. We perform simulations of an effect size δ and a null hypothesis H0 : δ = 0. Setting the significance level α = 0.05 we can simulate an effect size δ between two groups to determine when the test is stopped. If the simulated E-value crosses 1/α = 20, the test is stopped with H0 rejected. If no effect is detected, the test is stopped at a power of 1 − β = 0.8, as this power is common within industry. Figure 3 shows simulations of stopping times and decisions of the safe test compared to the t-test.

Figure 3: Histogram of stopping times for the safe t-test. The solid vertical line shows the average stopping time for the safe t-test and the classical t-test.

As we can see from the average stopping times in Figure 3, the safe t-test uses fewer than 500,000 samples to deliver statistically valid results, while the classical t-test requires over 600,000. However, the sample size required to reach 1 − β power for the safe t-test is approximately 850,000, much larger than that of the classical t-test. One may ask whether it is acceptable to simply conduct the safe t-test until the classical t-test sample size. Figure 4 (left) shows the impact of this action on the statistical errors. By the completion of the test, both the classical t-test and the safe t-test are meeting the requirement that the Type I errors are below α = 0.05 and the Type II errors are below β = 0.2. However, combining the two tests results in an inflated Type I error rate, and hence will not meet the experimenter’s expected level of statistical significance. Given the savings in test duration, there may be motivation to develop methods combine these tests in the future such that the false positive rate remains below α, for example using the Bonferroni correction.

As well as the overall conclusions of the two tests, it is interesting to consider the experiments for which the classical t-test and the safe t-test disagree. As seen in Figure 4 (right), while both tests achieve 80% power, they do so in very different ways. Many simulations for which the classical t-test accepts H0 are rejected by the safe t-test, and vice versa. This difference in outcomes will likely be difficult to internalize for practitioners who consider

Figure 4: Statistical error rates for both the safe and classical t-tests, and the results from combining their decisions (left); Decisions of the safe and classical t-tests on 1000 simulations (right).

the t-test to be the source of truth for their platform.

While Figure 3 evaluates safe stopping times for a fixed effect size, it is important to consider the results for a wide range of effect sizes. To aggregate the results of effect sizes from 0.01 to 0.3, we normalize the stopping times by the t-test stopping time. The results of this analysis can be seen in Figure 5.

Figure 5: Density histogram of the ratio of safe t-test sample sizes to the classical t-test.

The plot of Figure 5 shows both the average stop of the safe t-test and the sample size required for 80% power. On average, the safe test uses 18% less data than the t-test. In order to achieve the same power of 80%, however, the safe test uses 36% more data. Given that most A/B tests do not result in the rejection of H0 [Aze+20], this could result in longer experiments overall for practitioners.

4.4 Comparing the χ2 -test with the safe proportion test

Figure 6: Histogram of stopping times for the safe proportion test. The solid vertical line shows the average stopping time for the safe proportion test and the χ2 test.

The results of Figure 6 are remarkably similar to those seen comparing the t-test and the safe t-test in Figure 3. The safe test again uses fewer samples, on average, than its classical alternative, while the maximum stopping time to achieve the required power is higher. Next, we consider the sample sizes of the tests as a function of the difference ϵ. Figure 7 shows both the average and maximum stopping times for ϵ ∈ [0001, 0.1].

As seen in Figure 7, the average sample size required for the safe proportion test is less than that of the χ2 test for all values of ϵ. This suggests that the safe proportion test will be competitive with the χ2 test, even for detecting small effects. Looking at these results, one may question whether it is appropriate to set a prior based on an unknown effect size. However, the prior can based the effect size calculated from the data after each sample. Hence, setting the priors based on the current effect size has no impact on the validity of the test.

In this section, we have compared the safe t-test and the safe proportion test with their classical alternatives. It was found that average sample sizes for the safe t-test are smaller than those of the classical t-test for a wide range of effect sizes. However, the maximum sample size can be much greater to achieve the same statistical power. Additionally, the average sample sizes of the safe proportion test are smaller than those of the χ2 test. These findings motivate further adoption of safe tests in scientific endeavours. In the next section, we compare the safe t-test to another anytime-valid test used in industry, the mixture sequential probability ratio test.

Author:

(1) Daniel Beasley

This paper is available on arxiv under ATTRIBUTION-NONCOMMERCIAL-SHAREALIKE 4.0 INTERNATIONAL license.

How Safe Tests Reduce Sample Sizes Without Compromising Statistical Validity | HackerNoon

Table of Links

4 Safe Testing Simulations

4.1 Introduction

4.2 Python Implementation

4.3 Comparing the t-test with the Safe t-test

4.4 Comparing the χ2 -test with the safe proportion test

Leave a Reply Cancel reply

Stay Connected

Latest News

Trust Engines: How Creditcoin Makes Invisible Credit Histories Verifiable | HackerNoon

Popular beauty retailer confirms it will now be sold in over 600 stores

Stellantis’ Chinese partner set to build first European factory in Italy · TechNode

Jakarta EE 12 Will Focus on Consistency and Configuration

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

Table of Links

4 Safe Testing Simulations

4.1 Introduction

4.2 Python Implementation

4.3 Comparing the t-test with the Safe t-test

4.4 Comparing the χ2 -test with the safe proportion test

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News