首页> 外文OA文献 >Data complexity in machine learning and novel classification algorithms
【2h】

Data complexity in machine learning and novel classification algorithms

机译:机器学习中的数据复杂性和新颖的分类算法

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

This thesis summarizes four of my research projects in machine learning. One of them is on a theoretical challenge of defining and exploring complexity measures for data sets; the others are about new and improved classification algorithms.ududWe first investigate the role of data complexity in the context of binary classification problems. The universal data complexity is defined for a data set as the Kolmogorov complexity of the mapping enforced by that data set. It is closely related to several existing principles used in machine learning such as Occam's razor, the minimum description length, and the Bayesian approach. We demonstrate the application of the data complexity in two learning problems, data decomposition and data pruning. In data decomposition, we illustrate that a data set is best approximated by its principal subsets which are Pareto optimal with respect to the complexity and the set size. In data pruning, we show that outliers usually have high complexity contributions, and propose methods for estimating the complexity contribution. Experiments were carried out with a practical complexity measure on several toy problems.ududWe then propose a family of novel learning algorithms to directly minimize the 0/1 loss for perceptrons. A perceptron is a linear threshold classifier that separates examples with a hyperplane. Unlike most perceptron learning algorithms, which require smooth cost functions, our algorithms directly minimize the 0/1 loss, and usually achieve the lowest training error compared with other algorithms. The algorithms are also computationally efficient. Such advantages make them favorable for both standalone use and ensemble learning, on problems that are not linearly separable. Experiments show that our algorithms work very well with AdaBoost.ududWe also study ensemble methods that aggregate many base hypotheses in order to achieve better performance. AdaBoost is one such method for binary classification problems. The superior out-of-sample performance of AdaBoost has been attributed to the fact that it minimizes a cost function based on the margin, in that it can be viewed as a special case of AnyBoost, an abstract gradient descent algorithm. We provide a more sophisticated abstract boosting algorithm, CGBoost, based on conjugate gradient in function space. When the AdaBoost exponential cost function is optimized, CGBoost generally yields much lower cost and training error but higher test error, which implies that the exponential cost is vulnerable to overfitting. With the optimization power of CGBoost, we can adopt more "regularized" cost functions that have better out-of-sample performance but are difficult to optimize. Our experiments demonstrate that CGBoost generally outperforms AnyBoost in cost reduction. With suitable cost functions, CGBoost can have better out-of-sample performance.ududA multiclass classification problem can be reduced to a collection of binary problems with the aid of a coding matrix. The quality of the final solution, which is an ensemble of base classifiers learned on the binary problems, is affected by both the performance of the base learner and the error-correcting ability of the coding matrix. A coding matrix with strong error-correcting ability may not be overall optimal if the binary problems are too hard for the base learner. Thus a trade-off between error-correcting and base learning should be sought. In this paper, we propose a new multiclass boosting algorithm that modifies the coding matrix according to the learning ability of the base learner. We show experimentally that our algorithm is very efficient in optimizing the multiclass margin cost, and outperforms existing multiclass algorithms such as AdaBoost.ECC and one-vs-one. The improvement is especially significant when the base learner is not very powerful.
机译:本文总结了我在机器学习方面的四个研究项目。其中之一是在理论上挑战,即定义和探索数据集的复杂性度量; ud ud我们首先研究在二进制分类问题中数据复杂性的作用。为数据集定义通用数据复杂度,即该数据集强制执行的映射的Kolmogorov复杂度。它与机器学习中使用的几种现有原理(例如Occam的剃刀,最小描述长度和贝叶斯方法)紧密相关。我们演示了数据复杂性在两个学习问题中的应用,即数据分解和数据修剪。在数据分解中,我们说明了一个数据集可以通过其主要子集获得最佳近似,该子集在复杂性和集合大小方面是帕累托最优的。在数据修剪中,我们表明异常值通常具有较高的复杂度贡献,并提出了估算复杂度贡献的方法。我们针对一些玩具问题采用了实用的复杂性度量进行了实验。 ud ud然后,我们提出了一系列新颖的学习算法,以直接最小化感知器的0/1损失。感知器是将阈值与超平面分开的线性阈值分类器。与大多数需要平滑成本函数的感知器学习算法不同,我们的算法直接将0/1损失最小化,并且与其他算法相比,通常获得最低的训练误差。该算法在计算上也是有效的。这些优点使它们对于不可线性分离的问题,既适合独立使用,又适合集成学习。实验表明,我们的算法可与AdaBoost很好地配合使用。 ud ud我们还研究了集成许多基本假设的集成方法,以实现更好的性能。 AdaBoost是解决二进制分类问题的一种方法。 AdaBoost的出色的样本外性能归因于这样一个事实,即它可以将基于裕度的成本函数最小化,因为它可以被视为抽象梯度下降算法AnyBoost的特例。我们基于函数空间中的共轭梯度,提供了更复杂的抽象提升算法CGBoost。优化AdaBoost指数成本函数后,CGBoost通常会产生更低的成本和训练误差,但会产生更高的测试误差,这意味着指数成本容易过拟合。借助CGBoost的优化功能,我们可以采用更多的“常规”成本函数,这些函数具有更好的样本外性能,但难以优化。我们的实验表明,CGBoost在降低成本方面通常优于AnyBoost。有了合适的成本函数,CGBoost可以具有更好的样本外性能。 ud ud借助编码矩阵,可以将多类分类问题简化为二进制问题的集合。最终解决方案的质量是在二进制问题上学习到的基础分类器的集合,同时受基础学习器的性能和编码矩阵的纠错能力的影响。如果二进制问题对于基础学习者来说太难了,则具有强大的纠错能力的编码矩阵可能并不是最佳的。因此,应寻求在纠错和基础学习之间的权衡。在本文中,我们提出了一种新的多类提升算法,该算法根据基础学习者的学习能力来修改编码矩阵。我们通过实验证明了我们的算法在优化多类别保证金成本方面非常有效,并且优于现有的多类别算法(例如AdaBoost.ECC和一对多)。当基础学习者的能力不强时,该改进尤其显着。

著录项

  • 作者

    Li Ling;

  • 作者单位
  • 年度 2006
  • 总页数
  • 原文格式 PDF
  • 正文语种 {"code":"en","name":"English","id":9}
  • 中图分类

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号