...
首页> 外文期刊>International journal of applied mathematics and computer science >Interpretable decision-tree induction in a big data parallel framework
【24h】

Interpretable decision-tree induction in a big data parallel framework

机译:大数据并行框架中可解释的决策树归纳

获取原文
           

摘要

When running data-mining algorithms on big data platforms, a parallel, distributed framework, such asMAPREDUCE, may be used. However, in a parallel framework, each individual model fits the data allocated to its own computing node without necessarily fitting the entire dataset. In order to induce a single consistent model, ensemble algorithms such as majority voting, aggregate the local models, rather than analyzing the entire dataset directly. Our goal is to develop an efficient algorithm for choosing one representative model from multiple, locally induced decision-tree models. The proposed SySM (syntactic similarity method) algorithm computes the similarity between the models produced by parallel nodes and chooses the model which is most similar to others as the best representative of the entire dataset. In 18.75% of 48 experiments on four big datasets, SySM accuracy is significantly higher than that of the ensemble; in about 43.75% of the experiments, SySM accuracy is significantly lower; in one case, the results are identical; and in the remaining 35.41% of cases the difference is not statistically significant. Compared with ensemble methods, the representative tree models selected by the proposed methodology are more compact and interpretable, their induction consumes less memory, and, as confirmed by the empirical results, they allow faster classification of new records.
机译:在大数据平台上运行数据挖掘算法时,可以使用并行的分布式框架,例如MAPREDUCE。但是,在并行框架中,每个单独的模型都适合分配给自己的计算节点的数据,而不必适合整个数据集。为了引入单个一致的模型,诸如多数投票之类的集成算法会汇总局部模型,而不是直接分析整个数据集。我们的目标是开发一种有效的算法,以从多个局部诱导的决策树模型中选择一个代表性模型。提出的SySM(句法相似度方法)算法计算并行节点生成的模型之间的相似度,并选择与其他模型最相似的模型作为整个数据集的最佳代表。在四个大数据集上进行的48个实验中,有18.75%的SySM准确性显着高于集合。在大约43.75%的实验中,SySM准确性显着降低;在一种情况下,结果是相同的;在其余35.41%的情况下,差异无统计学意义。与集成方法相比,通过所提出的方法选择的代表性树模型更加紧凑和可解释,它们的归纳消耗更少的内存,并且,如经验结果所证实,它们允许对新记录进行更快的分类。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号