The number of classes as a source for instability of decision tree algorithms in high dimensional datasets

Baranauskas Jose Augusto

首页> 外文期刊>Artificial Intelligence Review: An International Science and Engineering Journal >The number of classes as a source for instability of decision tree algorithms in high dimensional datasets

【24h】

The number of classes as a source for instability of decision tree algorithms in high dimensional datasets

机译：高维数据集中作为决策树算法不稳定来源的类数

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

For a long time, experimental studies have been performed in a large number of fields of AI, specially in machine learning. A careful evaluation of a specific machine learning algorithm is important, but often difficult to conduct in practice. On the other hand, simulation studies can provide insights on behavior and performance aspects of machine learning approaches much more readily than using real-world datasets, where the target concept is normally unknown. Under decision tree induction algorithms an interesting source of instability that sometimes is neglected by researchers is the number of classes in the training set. This paper uses simulation to extended a previous work performed by Leo Breiman about properties of splitting criteria. Our simulation results have showed the number of best-splits grows according to the number of classes: exponentially, for both entropy and twoing criteria and linearly, for gini criterion. Since more splits imply more alternative choices, decreasing the number of classes in high dimensional datasets (ranging from hundreds to thousands of attributes, typically found in biomedical domains) can help lowering instability of decision trees. Another important contribution of this work concerns the fact that for 5 classes balanced datasets are prone to provide more best-splits (thus increasing instability) than imbalanced ones, including binary problems often addressable in machine learning; on the other hand, for five or more classes balanced datasets can provide few best-splits.

机译：长期以来，已经在AI的许多领域中进行了实验研究，特别是在机器学习方面。仔细评估特定的机器学习算法很重要，但在实践中通常很难进行。另一方面，模拟研究可以提供比机器学习方法更容易获得行为和性能方面的见解，而通常情况下目标概念是未知的，而使用真实世界的数据集则更容易。在决策树归纳算法下，一个有趣的不稳定源（有时被研究人员忽略了）是训练集中的课程数量。本文使用仿真来扩展Leo Breiman先前关于分裂准则的属性所做的工作。我们的仿真结果表明，最佳拆分的数量根据类别的数量而增长：对于熵和对偶准则，指数均呈指数增长;对于基尼准则，则呈线性增长。由于更多的拆分意味着更多的选择，因此减少高维数据集中的类数（从数百到数千个属性，通常在生物医学领域中发现）可以帮助降低决策树的不稳定性。这项工作的另一个重要贡献涉及以下事实：对于5类平衡数据集，与不平衡数据集相比，它倾向于提供更多的最佳拆分（从而增加了不稳定性），其中包括机器学习中通常可以解决的二进制问题。另一方面，对于五个或更多类，平衡数据集几乎无法提供最佳分割。

著录项

来源
《Artificial Intelligence Review: An International Science and Engineering Journal》 |2015年第2期|共10页
作者
Baranauskas Jose Augusto;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类人工智能理论;
关键词
Decision tree; Number of classes; Number of splits;

机译：决策树;类别数;分割数;

相似文献

外文文献
中文文献
专利

1. The number of classes as a source for instability of decision tree algorithms in high dimensional datasets [J] . Baranauskas Jose Augusto Artificial Intelligence Review: An International Science and Engineering Journal . 2015,第2期

机译：高维数据集中作为决策树算法不稳定来源的类数
2. CRAFTER: A Tree-Ensemble Clustering Algorithm for Static Datasets with Mixed Attributes and High Dimensionality [J] . Sangdi Lin, Bahareh Azarnoush, George C. Runger IEEE Transactions on Knowledge and Data Engineering . 2018,第9期

机译：CRAFTER：具有混合属性和高维性的静态数据集的树组合聚类算法
3. Towards Better Classification Using Improved Particle Swarm OptimizationAlgorithm and Decision Tree for Dengue Datasets [J] . B. Renuka Devi, K. Nageswara Rao, S. Pallam Setty International journal of soft computing . 2016,第1期

机译：使用改进的粒子群算法和登革热数据集决策树实现更好的分类
4. Comparison of Decision Tree, Neural Network, Statistic Learning, and k-NN Algorithms in Data Mining of Thyroid Disease Datasets [C] . Wafaa Al Somali, Riyad Al Shammari International Joint Conference on Biomedical Engineering Systems and Technologies . 2018

机译：甲状腺疾病数据集数据开采中决策树，神经网络，统计学习和K-NN算法的比较
5. Instability of decision tree classification algorithms. [D] . Li, Ruey-Hsia. 2001

机译：决策树分类算法的不稳定性。
6. Monte Carlo Tree Search-Based Recursive Algorithm for Feature Selection in High-Dimensional Datasets [O] . Muhammad Umar Chaudhry, Muhammad Yasir, Muhammad Nabeel Asghar, 2020

机译：基于蒙特卡罗树搜索的递归算法用于高维数据集中的特征选择
7. Classifying many-class high-dimensional fingerprint datasets using random forest of oblique decision trees [O] . 2015

机译：使用倾斜决策树的随机森林对多类高维指纹数据集进行分类

The number of classes as a source for instability of decision tree algorithms in high dimensional datasets

摘要

著录项

相似文献

相关主题

期刊订阅