首页> 外文OA文献 >Data complexity in machine learning and novel classification algorithms

【2h】

Data complexity in machine learning and novel classification algorithms

机译：机器学习中的数据复杂性和新颖的分类算法

代理获取

本网站仅为用户提供外文OA文献查询和代理获取服务，本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文，但由于OA文献来源多样且变更频繁，仍可能出现获取不到、文献不完整或与标题不符等情况，如果获取不到我们将提供退款服务。请知悉。

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

This thesis summarizes four of my research projects in machine learning. One of them is on a theoretical challenge of defining and exploring complexity measures for data sets; the others are about new and improved classification algorithms.ududWe first investigate the role of data complexity in the context of binary classification problems. The universal data complexity is defined for a data set as the Kolmogorov complexity of the mapping enforced by that data set. It is closely related to several existing principles used in machine learning such as Occam's razor, the minimum description length, and the Bayesian approach. We demonstrate the application of the data complexity in two learning problems, data decomposition and data pruning. In data decomposition, we illustrate that a data set is best approximated by its principal subsets which are Pareto optimal with respect to the complexity and the set size. In data pruning, we show that outliers usually have high complexity contributions, and propose methods for estimating the complexity contribution. Experiments were carried out with a practical complexity measure on several toy problems.ududWe then propose a family of novel learning algorithms to directly minimize the 0/1 loss for perceptrons. A perceptron is a linear threshold classifier that separates examples with a hyperplane. Unlike most perceptron learning algorithms, which require smooth cost functions, our algorithms directly minimize the 0/1 loss, and usually achieve the lowest training error compared with other algorithms. The algorithms are also computationally efficient. Such advantages make them favorable for both standalone use and ensemble learning, on problems that are not linearly separable. Experiments show that our algorithms work very well with AdaBoost.ududWe also study ensemble methods that aggregate many base hypotheses in order to achieve better performance. AdaBoost is one such method for binary classification problems. The superior out-of-sample performance of AdaBoost has been attributed to the fact that it minimizes a cost function based on the margin, in that it can be viewed as a special case of AnyBoost, an abstract gradient descent algorithm. We provide a more sophisticated abstract boosting algorithm, CGBoost, based on conjugate gradient in function space. When the AdaBoost exponential cost function is optimized, CGBoost generally yields much lower cost and training error but higher test error, which implies that the exponential cost is vulnerable to overfitting. With the optimization power of CGBoost, we can adopt more "regularized" cost functions that have better out-of-sample performance but are difficult to optimize. Our experiments demonstrate that CGBoost generally outperforms AnyBoost in cost reduction. With suitable cost functions, CGBoost can have better out-of-sample performance.ududA multiclass classification problem can be reduced to a collection of binary problems with the aid of a coding matrix. The quality of the final solution, which is an ensemble of base classifiers learned on the binary problems, is affected by both the performance of the base learner and the error-correcting ability of the coding matrix. A coding matrix with strong error-correcting ability may not be overall optimal if the binary problems are too hard for the base learner. Thus a trade-off between error-correcting and base learning should be sought. In this paper, we propose a new multiclass boosting algorithm that modifies the coding matrix according to the learning ability of the base learner. We show experimentally that our algorithm is very efficient in optimizing the multiclass margin cost, and outperforms existing multiclass algorithms such as AdaBoost.ECC and one-vs-one. The improvement is especially significant when the base learner is not very powerful.

机译：本文总结了我在机器学习方面的四个研究项目。其中之一是在理论上挑战，即定义和探索数据集的复杂性度量； ud ud我们首先研究在二进制分类问题中数据复杂性的作用。为数据集定义通用数据复杂度，即该数据集强制执行的映射的Kolmogorov复杂度。它与机器学习中使用的几种现有原理（例如Occam的剃刀，最小描述长度和贝叶斯方法）紧密相关。我们演示了数据复杂性在两个学习问题中的应用，即数据分解和数据修剪。在数据分解中，我们说明了一个数据集可以通过其主要子集获得最佳近似，该子集在复杂性和集合大小方面是帕累托最优的。在数据修剪中，我们表明异常值通常具有较高的复杂度贡献，并提出了估算复杂度贡献的方法。我们针对一些玩具问题采用了实用的复杂性度量进行了实验。 ud ud然后，我们提出了一系列新颖的学习算法，以直接最小化感知器的0/1损失。感知器是将阈值与超平面分开的线性阈值分类器。与大多数需要平滑成本函数的感知器学习算法不同，我们的算法直接将0/1损失最小化，并且与其他算法相比，通常获得最低的训练误差。该算法在计算上也是有效的。这些优点使它们对于不可线性分离的问题，既适合独立使用，又适合集成学习。实验表明，我们的算法可与AdaBoost很好地配合使用。 ud ud我们还研究了集成许多基本假设的集成方法，以实现更好的性能。 AdaBoost是解决二进制分类问题的一种方法。 AdaBoost的出色的样本外性能归因于这样一个事实，即它可以将基于裕度的成本函数最小化，因为它可以被视为抽象梯度下降算法AnyBoost的特例。我们基于函数空间中的共轭梯度，提供了更复杂的抽象提升算法CGBoost。优化AdaBoost指数成本函数后，CGBoost通常会产生更低的成本和训练误差，但会产生更高的测试误差，这意味着指数成本容易过拟合。借助CGBoost的优化功能，我们可以采用更多的“常规”成本函数，这些函数具有更好的样本外性能，但难以优化。我们的实验表明，CGBoost在降低成本方面通常优于AnyBoost。有了合适的成本函数，CGBoost可以具有更好的样本外性能。 ud ud借助编码矩阵，可以将多类分类问题简化为二进制问题的集合。最终解决方案的质量是在二进制问题上学习到的基础分类器的集合，同时受基础学习器的性能和编码矩阵的纠错能力的影响。如果二进制问题对于基础学习者来说太难了，则具有强大的纠错能力的编码矩阵可能并不是最佳的。因此，应寻求在纠错和基础学习之间的权衡。在本文中，我们提出了一种新的多类提升算法，该算法根据基础学习者的学习能力来修改编码矩阵。我们通过实验证明了我们的算法在优化多类别保证金成本方面非常有效，并且优于现有的多类别算法（例如AdaBoost.ECC和一对多）。当基础学习者的能力不强时，该改进尤其显着。

著录项

作者
Li Ling;
展开▼
作者单位

展开▼
年度 2006
总页数
原文格式 PDF
正文语种 {"code":"en","name":"English","id":9}
中图分类

相似文献

外文文献
中文文献
专利

1. EXPLORING MACHINE LEARNING CLASSIFICATION ALGORITHMS FOR CROP CLASSIFICATION USING SENTINEL 2 DATA [J] . S. S. Ray International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences . 2019,第4期

机译：使用Sentinel 2数据探索用于裁剪分类的机器学习分类算法
2. Machine learning models and algorithms for big data classification: thinking with examples for effective learning [J] . J. Myerson Computing reviews . 2016,第5期

机译：大数据分类的机器学习模型和算法：结合实例进行有效学习
3. Improving Time Complexity and Accuracy of the Machine Learning Algorithms Through Selection of Highly Weighted Top k Features from Complex Datasets [J] . Abdul Majeed Annals Data Science . 2019,第4期

机译：通过从复杂数据集中选择高度加权的前k个特征来提高机器学习算法的时间复杂度和准确性
4. Inner parameters' optimization in the artificial neural network for the traffic data classification in radiofrequency applications: Classification of nonstationary data using the machine learning algorithm “random forest” [C] . Kalashnikov Evgeniy Alexandrovich, Kondybayeva Almagul Baurzhanovna, Ositis Anastasia Petrovna Systems of Signals Generating and Processing in the Field of on Board Communications . 2018

机译：人工神经网络的内部参数优化，用于射频应用中的交通数据分类：使用机器学习算法“随机森林”对非平稳数据进行分类
5. Data complexity in machine learning and novel classification algorithms. [D] . Li, Ling. 2006

机译：机器学习中的数据复杂性和新颖的分类算法。
6. Biomedical Image Classification in a Big Data Architecture Using Machine Learning Algorithms [O] . Christian Tchito Tchapga, Thomas Attia Mih, Aurelle Tchagna Kouanou, 2021

机译：使用机器学习算法的大数据架构中的生物医学图像分类
7. Select the Best Machine Learning Algorithms for Prediction and Classification of Intrusions using KDD99 Intrusion Detection Dataset [O] . Sara Tamy, Hicham Belhadaoui, Mahmoud Almostafa Rabbah, 2019

机译：使用KDD99入侵检测数据集选择最佳机器学习算法，用于预测和分类入侵

Data complexity in machine learning and novel classification algorithms

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅