首页> 外文会议>Annual conference on Neural Information Processing Systems >Optimal kernel choice for large-scale two-sample tests
【24h】

Optimal kernel choice for large-scale two-sample tests

机译:大规模两样本测试的最佳内核选择

获取原文

摘要

Given samples from distributions p and q, a two-sample test determines whether to reject the null hypothesis that p = q, based on the value of a test statistic measuring the distance between the samples. One choice of test statistic is the maximum mean discrepancy (MMD), which is a distance between embeddings of the probability distributions in a reproducing kernel Hilbert space. The kernel used in obtaining these embeddings is critical in ensuring the test has high power, and correctly distinguishes unlike distributions with high probability. A means of parameter selection for the two-sample test based on the MMD is proposed. For a given test level (an upper bound on the probability of making a Type I error), the kernel is chosen so as to maximize the test power, and minimize the probability of making a Type II error. The test statistic, test threshold, and optimization over the kernel parameters are obtained with cost linear in the sample size. These properties make the kernel selection and test procedures suited to data streams, where the observations cannot all be stored in memory. In experiments, the new kernel selection approach yields a more powerful test than earlier kernel selection heuristics.
机译:给定来自分布p和q的样本,两次样本检验根据测量样本之间距离的检验统计值确定是否拒绝p = q的零假设。测试统计量的一种选择是最大平均差异(MMD),它是再现内核Hilbert空间中概率分布的嵌入之间的距离。获取这些嵌入所使用的内核对于确保测试具有高功效至关重要,并且可以以高概率正确地区分不同的分布。提出了一种基于MMD的二样本测试参数选择方法。对于给定的测试级别(发生I型错误的可能性的上限),选择内核以最大化测试能力,并降低发生II型错误的可能性。测试统计量,测试阈值以及对内核参数的优化是在样本量中以线性成本获得的。这些属性使内核选择和测试过程适合于数据流,在这些数据流中,观察值无法全部存储在内存中。在实验中,新的内核选择方法比早期的内核选择启发式方法产生了更强大的测试。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号