【24h】

BOAT-Optimistic Decision Tree Construction

机译:船乐观决策树建设

获取原文

摘要

Classification is an important data mining problem. Given a training database of records, each tagged with a class label, the goal of classification is to build a concise model that can be used to predict the class label of future, unlabeled records. A very popular class of classifiers are decision trees. All current algorithms to construct decision trees, including all main-memory algorithms, make one scan over the training database per level of the tree. We introduce a new algorithm (BOAT) for decision tree construction that improves upon earlier algorithms in both performance and functionality. BOAT constructs several levels of the tree in only two scans over the training database, resulting in an average performance gain of 300% over previous work. The key to this performance improvement is a novel optimistic approach to tree construction in which we construct an initial tree using a small subset of the data and refine it to arrive at the final tree. We guarantee that any difference with respect to the "real" tree (i.e., the tree that would be constructed by examining all the data in a traditional way) is detected and corrected. The correction step occasionally requires us to make additional scans over subsets of the data; typically, this situation rarely arises, and can be addressed with little added cost. Beyond offering faster tree construction, BOAT is the first scalable algorithm with the ability to incrementally update the tree with respect to both insertions and deletions over the dataset. This property is valuable in dynamic environments such as data warehouses, in which the training dataset changes over time. The BOAT update operation is much cheaper than completely rebuilding the tree, and the resulting tree is guaranteed to be identical to the tree that would be produced by a complete re-build.
机译:分类是一个重要的数据挖掘问题。鉴于记录的训练数据库,每个标签具有类别标签,分类的目标是建立一个可以用来预测未来的阶级标签,没有标签记录的简洁模式。一个非常受欢迎的类分类器是决策树。目前所有的算法来构建决策树,包括所有的主内存的算法,使得在每树的级别的训练数据库进行一次扫描。我们引进了决策树构造一个新的算法(船)后在性能和功能早些时候算法改进。船构造树的几个层次中只有两个在训练扫描数据库,造成了前期工作的300%的平均性能提升。这种性能提高的关键是一种新的乐观地对待树构建中,我们使用数据的一小部分构造一个初始树和完善它在最终的树到达。我们保证,相对于“真正的”树中的任何差异(即,将通过检查以传统方式中的所有数据将建造的树)被检测和校正。校正步骤偶尔需要我们在数据的子集进行额外扫描;通常,这种情况很少发生,而且可以加入少量的成本来解决。除了提供更快的树结构,船是第一个可扩展的算法相对于在数据集都插入和删除增量更新的树。此属性是在动态环境中有价值的,如数据仓库,在这一段时间的训练数据集的变化。船更新操作比完全重建树,并将得到的树被保证是等同于将被完全重新构建待生产的树便宜得多。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号