首页> 外文OA文献 >Parametric classification and variable selection by the minimum integrated squared error criterion
【2h】

Parametric classification and variable selection by the minimum integrated squared error criterion

机译:通过最小综合平方误差准则进行参数分类和变量选择

摘要

This thesis presents a robust solution to the classification and variable selection problem when the dimension of the data, or number of predictor variables, may greatly exceed the number of observations. When faced with the problem of classifying objects given many measured attributes of the objects, the goal is to build a model that makes the most accurate predictions using only the most meaningful subset of the available measurements. The introduction of [cursive l] 1 regularized model titling has inspired many approaches that simultaneously do model fitting and variable selection. If parametric models are employed, the standard approach is some form of regularized maximum likelihood estimation. While this is an asymptotically efficient procedure under very general conditions, it is not robust. Outliers can negatively impact both estimation and variable selection. Moreover, outliers can be very difficult to identify as the number of predictor variables becomes large. Minimizing the integrated squared error, or L 2 error, while less efficient, has been shown to generate parametric estimators that are robust to a fair amount of contamination in several contexts. In this thesis, we present a novel robust parametric regression model for the binary classification problem based on L 2 distance, the logistic L 2 estimator (L 2 E). To perform simultaneous model fitting and variable selection among correlated predictors in the high dimensional setting, an elastic net penalty is introduced. A fast computational algorithm for minimizing the elastic net penalized logistic L 2 E loss is derived and results on the algorithm's global convergence properties are given. Through simulations we demonstrate the utility of the penalized logistic L 2 E at robustly recovering sparse models from high dimensional data in the presence of outliers and inliers. Results on real genomic data are also presented.
机译:当数据的维数或预测变量的数量可能大大超过观测值的数量时,本论文为分类和变量选择问题提供了一种鲁棒的解决方案。当面临根据给定的对象的许多测量属性对对象进行分类的问题时,目标是建立一个仅使用可用测量值中最有意义的子集进行最准确预测的模型。 [cursive l] 1正则化模型标题的引入启发了许多同时进行模型拟合和变量选择的方法。如果采用参数模型,则标准方法是某种形式的正规化最大似然估计。尽管这是在非常一般的条件下渐近有效的过程,但它并不可靠。离群值会对估计和变量选择产生负面影响。此外,随着预测变量的数量变大,离群值可能很难识别。最小化积分平方误差或L 2误差虽然效率较低,但已显示出可以在几种情况下对相当数量的污染具有鲁棒性的参数估计量。在本文中,我们提出了一种基于L 2距离的对数分类问题的新型鲁棒参数回归模型,即逻辑L 2估计量(L 2 E)。为了在高维设置中同时在相关预测变量之间执行模型拟合和变量选择,引入了弹性净罚分。推导了一种最小化弹性净罚分逻辑L 2 E损失的快速计算算法,并给出了该算法的全局收敛性结果。通过仿真,我们证明了在存在离群值和离群值的情况下,从高维数据中健壮地恢复稀疏模型的惩罚逻辑L 2 E的效用。还介绍了实际基因组数据的结果。

著录项

  • 作者

    Chi Eric C.;

  • 作者单位
  • 年度 2012
  • 总页数
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号