(n) is smaller than the number of '/> Hypothesis Testing in High-Dimensional Regression Under the Gaussian Random Design Model: Asymptotic Theory
首页> 外文期刊>Information Theory, IEEE Transactions on >Hypothesis Testing in High-Dimensional Regression Under the Gaussian Random Design Model: Asymptotic Theory
【24h】

Hypothesis Testing in High-Dimensional Regression Under the Gaussian Random Design Model: Asymptotic Theory

机译:高斯随机设计模型下高维回归的假设检验:渐近理论

获取原文
获取原文并翻译 | 示例
           

摘要

We consider linear regression in the high-dimensional regime where the number of observations (n) is smaller than the number of parameters (p) . A very successful approach in this setting uses (ell _{1}) -penalized least squares (also known as the Lasso) to search for a subset of (s_{0}< n) parameters that best explain the data, while setting the other parameters to zero. Considerable amount of work has been devoted to characterizing the estimation and model selection problems within this approach. In this paper, we consider instead the fundamental, but far less understood, question of statistical significance. More precisely, we address the problem of computing p-values for single regression coefficients. On one hand, we develop a general upper bound on the minimax power of tests with a given significance level. We show that rigorous guarantees for earlier methods do not allow to achieve this bound, except in special cases. On the other, we prove that this upper bound is (nearly) achievable through a practical procedure in the case of random design matrices with independent entries. Our approach is based on a debiasing of the Lasso estimator. The analysis builds on a rigorous characterization of the asymptotic distribution of the Lasso estimator and its debiased version. Our result holds for optimal sample size, i.e., when (n) is at least on the order of (s_{0} log (p/s_{0})) . We generalize our approach to random design matrices with independent identically distributed Gaussian rows ( bo- dsymbol {x}_{i}sim {sf N} (0, boldsymbol {Sigma })) . In this case, we prove that a similar distributional characterization (termed standard distributional limit) holds for (n) much larger than (s_{0}(log p)^{2}) . Our analysis assumes ( boldsymbol {Sigma }) is known. To cope with unknown ( boldsymbol {Sigma }) , we suggest a plug-in estimator for sparse covariances ( boldsymbol {Sigma }) and validate the method through numerical simulations. Finally, we show that for optimal sample size, (n) being at least of order (s_{0} log (p/s_{0})) , the standard distributional limit for general Gaussian designs can be derived from the replica heuristics in statistical physics. This derivation suggests a stronger conjecture than the result we prove, and near-optimality of the statistical power for a large class of Gaussian designs.
机译:我们考虑在高维状态下的线性回归,其中观察数 (n) 小于该数量参数 (p) 。在这种情况下,一种非常成功的方法是使用 (ell _ {1}) -最小化最小二乘(也称为套索)以搜索最能解释数据的 (s_ {0} 参数的子集,同时将其他参数设置为零。在这种方法中,已经进行了大量工作来表征估计和模型选择问题。在本文中,我们考虑的是统计意义上的根本性但尚不为人所知的问题。更准确地说,我们解决了为单个回归系数计算p值的问题。一方面,我们在给定的显着性水平下,制定了检验的最小最大功效的一般上限。我们显示,除非在特殊情况下,否则对早期方法的严格保证不允许实现此限制。另一方面,我们证明了在具有独立条目的随机设计矩阵的情况下,可以通过一个实际过程来达到这个上限。我们的方法基于对套索估计器的去偏。该分析建立在对套索估计量及其无偏差版本的渐近分布进行严格刻画的基础上。我们的结果适用于最佳样本量,即,当 (n) 至少为 (s_ {0}日志(p / s_ {0})) 。我们将我们的方法推广到具有独立相同分布的高斯行 (bo-dsymbol {x} _ {i} sim {sf N}(0,boldsymbol { Sigma})) 。在这种情况下,我们证明 (n) 具有相似的分布特征(称为标准分布极限)大于 (s_ {0}(log p)^ {2}) 。我们的分析假设 (boldsymbol {Sigma}) 是已知的。为了应对未知的 (粗体符号{Sigma}) ,我们建议使用稀疏协方差 (boldsymbol {Sigma}) 并通过数值模拟验证该方法。最后,我们表明,对于最佳样本量, (n) 至少为 (s_ {0} log(p / s_ {0})) ,一般高斯设计的标准分布极限可以从统计物理学中的复制启发法。这种推导表明,比我们证明的结果更容易猜想,而且对于大量高斯设计,统计功效几乎是最优的。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号