首页> 外文学位 >On the consistency of ensemble classification algorithms.
【24h】

On the consistency of ensemble classification algorithms.

机译:关于整体分类算法的一致性。

获取原文
获取原文并翻译 | 示例

摘要

The fields of machine learning and statistics experienced fast growth in the past 25 years. A lot of theoretical and empirical research was done in the area of classification and a lot of new algorithms were proposed. Not all of them survived the scrutiny of tests of their effectiveness, but some were shown to be very effective and are widely used in practice even if their theoretical foundations are unclear.;In this work we study two of the most successful classification and regression algorithms that were introduced in the past 15 years---AdaBoost and Random Forests. In both cases we are interested in the consistency of these algorithms in the classification setting.;In the case of the AdaBoost algorithm we investigate the risk, or probability of error, of the classifier produced by the algorithm. In particular, we consider the stopping strategy to be used in AdaBoost to achieve universal consistency. We show that provided AdaBoost is stopped after n 1-&egr; iterations---for sample size n and &egr; ∈ (0, 1)---the sequence of risks of the classifiers it produces approaches the Bayes risk. This answers the long-standing question of whether the AdaBoost algorithm in its original formulation is consistent or not.;For the Random Forests algorithm the goal was also to establish consistency or show the lack of it since there seems to exist some confusion about the algorithm: Breiman's words [17] "the generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large" are interpreted, at least sometimes, as a statement on the consistency of the algorithm. We consider a classification setting and provide a simple proof that in the one-dimensional case Random Forests, as formulated in [17] and implemented in the randomForests R package, is nothing but a 1-nearest neighbour classifier, hence it is not consistent. The proof also reveals that the algorithm can be made consistent by choosing the bootstrap sample size sublinear in the training sample size. A simulation study suggests that the same might be true in higher dimensions.
机译:在过去的25年中,机器学习和统计领域经历了快速的增长。在分类领域进行了大量的理论和实证研究,并提出了许多新的算法。并非所有这些方法都通过了有效性测试的审查,但即使它们的理论基础尚不清楚,也有一些方法被证明是非常有效的,并且在实践中得到了广泛应用;在这项工作中,我们研究了两种最成功的分类和回归算法在过去15年中推出的产品-AdaBoost和Random Forests。在这两种情况下,我们都对这些算法在分类设置中的一致性感兴趣。在AdaBoost算法的情况下,我们调查了该算法产生的分类器的风险或错误概率。特别是,我们考虑在AdaBoost中使用停止策略来实现通用一致性。我们证明,只要n 1-&egr;之后停止提供AdaBoost。迭代-对于样本大小n和&egr; ∈(0,1)-分类器产生的风险序列接近贝叶斯风险。这回答了长期存在的问题,即AdaBoost算法在其原始公式中是否一致。;对于Random Forests算法,目标还在于建立一致性或表明缺乏一致性,因为对该算法似乎存在一些困惑:Breiman的话[17]“随着森林中树木数量的增加,森林的泛化误差收敛到极限”,至少有时被解释为算法一致性的陈述。我们考虑一个分类设置,并提供一个简单的证据,证明在一维情况下,如[17]中所述并在randomForests R包中实现的随机森林仅是1-最近邻分类器,因此不一致。证明还表明,可以通过在训练样本量中选择次线性的引导样本量来使算法一致。仿真研究表明,在更高维度上也可能如此。

著录项

  • 作者

    Traskin, Mikhail Petrovich.;

  • 作者单位

    University of California, Berkeley.;

  • 授予单位 University of California, Berkeley.;
  • 学科 Statistics.
  • 学位 Ph.D.
  • 年度 2007
  • 页码 98 p.
  • 总页数 98
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号