首页> 外文学位 >Optimal instance selection for improved decision tree.
【24h】

Optimal instance selection for improved decision tree.

机译:优化实例选择以改进决策树。

获取原文
获取原文并翻译 | 示例

摘要

Instance selection plays an important role in improving scalability of data mining algorithms, but it can also be used to improve the quality of the data mining results. In this dissertation we present a new optimization-based approach for instance selection that uses a genetic algorithm (GA) to select a subset of instances to produce a simpler decision tree with acceptable accuracy. The resultant trees are likely to be easier to comprehend and interpret by the decision maker and hence more useful in practice. We present numerical results for several difficult test datasets that indicate that GA-based instance selection can often reduce the size of the decision tree by an order of magnitude while still maintaining good prediction accuracy. The results suggest that GA-based instance selection works best for low entropy datasets. With higher entropy, there will be less benefit from instance selection. A comparison between GA and other heuristic approaches such as Rmhc (Random Mutation Hill Climbing) and simple construction heuristic, indicates that GA is able to obtain a good solution with low computation cost even for some large datasets. One advantage of instance selection is that it is able to increase the average instances associated with the leaves of the decision trees to avoid overfitting, thus instance selection can be used as an effective alternative to prune decision trees. Finally, the analysis on the selected instances reveals that instance selection helps to reduce outliers, reduce missing values, and select the most useful instances for separating classes.
机译:实例选择在提高数据挖掘算法的可伸缩性中起着重要作用,但是它也可以用于提高数据挖掘结果的质量。在本文中,我们提出了一种新的基于优化的实例选择方法,该方法使用遗传算法(GA)选择实例的子集,以产生具有可接受精度的更简单决策树。生成的树可能更易于决策者理解和解释,因此在实践中更有用。我们提供了一些困难的测试数据集的数值结果,这些数据表明基于GA的实例选择通常可以将决策树的大小减小一个数量级,同时仍保持良好的预测准确性。结果表明,基于GA的实例选择最适用于低熵数据集。熵值越高,实例选择的好处越少。 GA与其他启发式方法(如Rmhc(随机变异爬山)和简单构造启发式)之间的比较表明,即使对于某些大型数据集,GA仍能够以较低的计算成本获得良好的解决方案。实例选择的一个优点是它能够增加与决策树的叶子相关联的平均实例,以避免过度拟合,因此实例选择可以用作修剪决策树的有效替代方法。最后,对选定实例的分析表明,实例选择有助于减少异常值,减少缺失值以及选择最有用的实例来分离类。

著录项

  • 作者

    Wu, Shuning.;

  • 作者单位

    Iowa State University.;

  • 授予单位 Iowa State University.;
  • 学科 Engineering Industrial.
  • 学位 Ph.D.
  • 年度 2007
  • 页码 144 p.
  • 总页数 144
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号