On the consistency of ensemble classification algorithms.

机译：关于整体分类算法的一致性。

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

The fields of machine learning and statistics experienced fast growth in the past 25 years. A lot of theoretical and empirical research was done in the area of classification and a lot of new algorithms were proposed. Not all of them survived the scrutiny of tests of their effectiveness, but some were shown to be very effective and are widely used in practice even if their theoretical foundations are unclear.;In this work we study two of the most successful classification and regression algorithms that were introduced in the past 15 years---AdaBoost and Random Forests. In both cases we are interested in the consistency of these algorithms in the classification setting.;In the case of the AdaBoost algorithm we investigate the risk, or probability of error, of the classifier produced by the algorithm. In particular, we consider the stopping strategy to be used in AdaBoost to achieve universal consistency. We show that provided AdaBoost is stopped after n 1-&egr; iterations---for sample size n and &egr; ∈ (0, 1)---the sequence of risks of the classifiers it produces approaches the Bayes risk. This answers the long-standing question of whether the AdaBoost algorithm in its original formulation is consistent or not.;For the Random Forests algorithm the goal was also to establish consistency or show the lack of it since there seems to exist some confusion about the algorithm: Breiman's words [17] "the generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large" are interpreted, at least sometimes, as a statement on the consistency of the algorithm. We consider a classification setting and provide a simple proof that in the one-dimensional case Random Forests, as formulated in [17] and implemented in the randomForests R package, is nothing but a 1-nearest neighbour classifier, hence it is not consistent. The proof also reveals that the algorithm can be made consistent by choosing the bootstrap sample size sublinear in the training sample size. A simulation study suggests that the same might be true in higher dimensions.

机译：在过去的25年中，机器学习和统计领域经历了快速的增长。在分类领域进行了大量的理论和实证研究，并提出了许多新的算法。并非所有这些方法都通过了有效性测试的审查，但即使它们的理论基础尚不清楚，也有一些方法被证明是非常有效的，并且在实践中得到了广泛应用；在这项工作中，我们研究了两种最成功的分类和回归算法在过去15年中推出的产品-AdaBoost和Random Forests。在这两种情况下，我们都对这些算法在分类设置中的一致性感兴趣。在AdaBoost算法的情况下，我们调查了该算法产生的分类器的风险或错误概率。特别是，我们考虑在AdaBoost中使用停止策略来实现通用一致性。我们证明，只要n 1-＆egr;之后停止提供AdaBoost。迭代-对于样本大小n和＆egr; ∈（0，1）-分类器产生的风险序列接近贝叶斯风险。这回答了长期存在的问题，即AdaBoost算法在其原始公式中是否一致。;对于Random Forests算法，目标还在于建立一致性或表明缺乏一致性，因为对该算法似乎存在一些困惑：Breiman的话[17]“随着森林中树木数量的增加，森林的泛化误差收敛到极限”，至少有时被解释为算法一致性的陈述。我们考虑一个分类设置，并提供一个简单的证据，证明在一维情况下，如[17]中所述并在randomForests R包中实现的随机森林仅是1-最近邻分类器，因此不一致。证明还表明，可以通过在训练样本量中选择次线性的引导样本量来使算法一致。仿真研究表明，在更高维度上也可能如此。

著录项

作者
Traskin, Mikhail Petrovich.;
展开▼
作者单位

University of California, Berkeley.;

展开▼
授予单位 University of California, Berkeley.;
学科 Statistics.
学位 Ph.D.
年度 2007
页码 98 p.
总页数 98
原文格式 PDF
正文语种 eng
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. A comparative study on classification of sleep stage based on EEG signals using feature selection and classification algorithms. [J] . Baha Sen, Musa Peker, Abdullah Cavu?o?lu, Journal of medical systems . 2014,第3期

机译：基于脑电信号特征选择和分类算法的睡眠阶段分类比较研究。
2. Classifier ensemble construction with rotation forest to improve medical diagnosis performance of machine learning algorithms. [J] . Ozcift A, Gulten A Computer Methods and Programs in Biomedicine: An International Journal Devoted to the Development, Implementation and Exchange of Computing Methodology and Software Systems in Biomedical Research and Medical Practice . 2011,第3期

机译：利用旋转森林进行分类器集成，以提高机器学习算法的医学诊断性能。
3. Atomic-level characterization of the ensemble of the Abeta(1-42) monomer in water using unbiased molecular dynamics simulations and spectral algorithms. [J] . Sgourakis NG, Merced Serrano M, Boutsidis C, Journal of Molecular Biology . 2011,第2期

机译：使用无偏分子动力学模拟和光谱算法对水中Abeta（1-42）单体的集合进行原子级表征。
4. Consistency versus Realizable H-Consistency for Multiclass Classification [C] . Philip M. Long, Rocco A. Servedio International Conference on Machine Learning . 2013

机译：一致性与可实现的H-一致性用于多字母分类
5. Integrating Digital Aerial Photography And LiDAR-derived Information For Object-based Coastal Tidal Marsh Classification Using Tree-based Ensemble Algorithms. [D] . Yi, Chang. 2012

机译：使用基于树的集成算法，将数字航空摄影和LiDAR派生的信息集成在一起，以进行基于对象的沿海潮汐沼泽分类。
6. An evaluation of computer assisted clinical classification algorithms. [O] . C. G. Chute, Y. Yang, J. Buntrock 1994

机译：对计算机辅助临床分类算法的评估。
7. NIMEFI: gene regulatory network inference using multiple ensemble feature importance algorithms. [O] . Joeri Ruyssinck, Vân Anh Huynh-Thu, Pierre Geurts, 2014

机译：NImEFI：使用多个集合特征重要性算法的基因调控网络推断。

On the consistency of ensemble classification algorithms.

摘要

著录项

相似文献

相关主题

期刊订阅