首页> 外文会议>IEEE International Conference on Machine Learning and Applications >Resampling-Based Variable Selection with Lasso for p n and Partially Linear Models
【24h】

Resampling-Based Variable Selection with Lasso for p n and Partially Linear Models

机译:p n和部分线性模型的基于套索的基于重采样的变量选择

获取原文

摘要

The linear model of the regression function is a widely used and perhaps, in most cases, highly unrealistic simplifying assumption, when proposing consistent variable selection methods for large and highly-dimensional datasets. In this paper, we study what happens from theoretical point of view, when a variable selection method assumes a linear regression function and the underlying ground-truth model is composed of a linear and a non-linear term, that is at most partially linear. We demonstrate consistency of the Lasso method when the model is partially linear. However, we note that the algorithm tends to increase even more the number of selected false positives on partially linear models when given few training samples. That is usually because the values of small groups of samples happen to explain variation coming from the non-linear part of the response function and the noise, using a linear combination of wrong predictors. We demonstrate theoretically that false positives are likely to be selected by the Lasso method due to a small proportion of samples, which happen to explain some variation in the response variable. We show that this property implies that if we run the Lasso on several slightly smaller size data replications, sampled without replacement, and intersect the results, we are likely to reduce the number of false positives without losing already selected true positives. We propose a novel consistent variable selection algorithm based on this property and we show it can outperform other variable selection methods on synthetic datasets of linear and partially linear models and datasets from the UCI machine learning repository.
机译:当为大型和高维数据集提出一致的变量选择方法时,回归函数的线性模型是一种广泛使用的方法,在大多数情况下,可能是非常不现实的简化假设。在本文中,我们将从理论的角度研究发生的情况,当变量选择方法采用线性回归函数并且基础的真实模型由线性项和非线性项组成时,线性项和非线性项最多是部分线性的。当模型为部分线性时,我们证明了套索方法的一致性。但是,我们注意到在给定训练样本很少的情况下,该算法往往会增加部分线性模型上选择的误报的数量。这通常是因为使用错误的预测变量的线性组合,一小组样本的值恰好可以解释来自响应函数的非线性部分和噪声的变化。我们从理论上证明,由于样本比例很小,通过套索方法可能会选择假阳性,这恰好解释了响应变量的某些变化。我们表明,此属性表示,如果对几个略小一些的数据复制运行Lasso,不进行替换就进行采样,然后与结果相交,则很可能会减少误报的数量,而不会丢失已经选择的真正的正数。我们基于此属性提出了一种新颖的一致变量选择算法,并证明了它在线性和部分线性模型的综合数据集以及UCI机器学习存储库中的数据集上优于其他变量选择方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号