How E-Variables Prevent False Positive Inflation | HackerNoon

Table of Links

Introduction
Hypothesis testing

2.1 Introduction

2.2 Bayesian statistics

2.3 Test martingales

2.4 p-values

2.5 Optional Stopping and Peeking

2.6 Combining p-values and Optional Continuation

2.7 A/B testing
Safe Tests

3.1 Introduction

3.2 Classical t-test

3.3 Safe t-test

3.4 χ2 -test

3.5 Safe Proportion Test
Safe Testing Simulations

4.1 Introduction and 4.2 Python Implementation

4.3 Comparing the t-test with the Safe t-test

4.4 Comparing the χ2 -test with the safe proportion test
Mixture sequential probability ratio test

5.1 Sequential Testing

5.2 Mixture SPRT

5.3 mSPRT and the safe t-test
Online Controlled Experiments

6.1 Safe t-test on OCE datasets
Vinted A/B tests and 7.1 Safe t-test for Vinted A/B tests

7.2 Safe proportion test for sample ratio mismatch
Conclusion and References

3 Safe Tests

3.1 Introduction

Safe testing [GHK23] is a novel method of hypothesis testing developed to address many issues with modern statistical inference. The safe in safe testing refers to the fact that the false positive rate does not increase above α in the optional continuation setting. As we will see, many safe tests also allow for optional stopping [GHK23], specifically the ones we will apply to the safe t-test and the safe proportion test. Figure 2 shows how the false positive rate of the safe t-test changes over an experiment.

Figure 2: False positive probability for the classical t-test and the safe t-test.

Safe testing is based on E-variables or (E-test statistics), which are non-negative random variables which satisfy

Under the null hypothesis, many E-variables behave as test martingales [GHK23], which are closely related to Bayes factors.

In addition to this intuitive interpretation, E-variables provide many mathematical benefits as well. Earlier we highlighted p-values, optional stopping, and optional continuation as a few problems with classical statistical testing. We proceed now by discussing these issues in the context of E-variables.

For situations in which effect sizes are unknown or for tests with nuisance parameters, GROW may be indeterminable. However, the optimal growth can be determined relative to the unknown parameter. An E-variable with this property is known as relative GROW. These concepts will be applied in the derivations of the safe t-statistic and the safe proportion test statistic.

While there exist E-variables that are not safe under optional stopping [GHK23], A/B testing uses fairly common statistical tests for which optional stopping E-variables are available. The first such test we’ll explore is the t-test, beginning with the theory behind the classical t-test.

3.2 Classical t-test

The t-statistic in converted to a p-value using the t-distribution with ν = n+m−2 degrees of freedom,

which is then used to make a decision about the hypothesis.

The sample size for the t-test is determined by α, β, and the effect size δ. Before the data are collected, the effect size is unknown and must be estimated. After the test, the effect size can be calculated with Cohen’s d, which represents the overall difference between the groups

where sp is the pooled standard deviation

3.3 Safe t-tes

The one-sided safe t-test statistic has been shown to be GROW and the two-sided test statistic to be relative GROW [Pér+22]. Next, we discuss the χ2 test and its safe alternative.

3.4 χ2-test

The χ2 test is a classical statistical test that is used to assess the distribution of contingency table cells. A contingency table contains the frequencies of the multinomial data, allowing one to assess the similarities of the two distributions’ parameters. In the case of binomial data, the contingency table is 2×2, which will be the focus of this section.

The χ2 statistic in converted to a p-value using the χ2 distribution with (r − 1)(c − 1) degrees of freedom, where r and c are the number of rows and columns in the table. As with the classical t-test, the χ2 is not safe under optional stopping, and thus peeking can inflate their false positive rate [Xu+22]. For this reason, safe alternatives that allow anytime-valid inference have been developed, which we will explore now.

3.5 Safe Proportion Test

Next, consider the quantity

and n1 = na1 + nb1. Under H1 the joint distribution is

Combining 6, 7, and 8 and simplifying (see [TLG22] for details) gives the final expression for the relative GROW E-variable of batch size na + nb:

In the next section, we compare the safe t-test and the safe proportion test to their classical alternatives.

Author:

(1) Daniel Beasley

This paper is available on arxiv under ATTRIBUTION-NONCOMMERCIAL-SHAREALIKE 4.0 INTERNATIONAL license.

How E-Variables Prevent False Positive Inflation | HackerNoon

Table of Links

3 Safe Tests

3.1 Introduction

3.2 Classical t-test

3.3 Safe t-tes

3.4 χ2-test

3.5 Safe Proportion Test

Leave a Reply Cancel reply

Stay Connected

Latest News

F&W Networks, Fusion Fibre team to accelerate gigabit broadband | Computer Weekly

PCIC Model Design: Category-Level Repurchase Prediction and Frequency‑Recency Item Ranking | HackerNoon

Kendall Jenner secretly drops $23 million on Montecito estate with horse stables

SMIC becomes the world’s second largest wafer foundry · TechNode

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

Table of Links

3 Safe Tests

3.1 Introduction

3.2 Classical t-test

3.3 Safe t-tes

3.4 χ2-test

3.5 Safe Proportion Test

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News