Software packages usually report the results of statistical tests usingp-values. Users often interpret these by comparing them to standard thresholds,e.g. 0.1%, 1% and 5%, which is sometimes reinforced by a star rating (***, **,*). In this article, we consider an arbitrary statistical test whose p-value pis not available explicitly, but can be approximated by Monte Carlo samples,e.g. by bootstrap or permutation tests. The standard implementation of suchtests usually draws a fixed number of samples to approximate p. However, theprobability that the exact and the approximated p-value lie on different sidesof a threshold (the resampling risk) can be high, particularly for p-valuesclose to a threshold. We present a method to overcome this. We consider afinite set of user-specified intervals which cover [0,1] and which can beoverlapping. We call these p-value buckets. We present algorithms that, witharbitrarily high probability, return a p-value bucket containing p. We provethat for both a bounded resampling risk and a finite runtime, overlappingbuckets need to be employed, and that our methods both bound the resamplingrisk and guarantee a finite runtime for such overlapping buckets. To interpretdecisions with overlapping buckets, we propose an extension of the star ratingsystem. We demonstrate that our methods are suitable for use in standardsoftware, including for low p-values occurring in multiple testing settings,and that they can be computationally more efficient than standardimplementations.
展开▼