首页> 外文OA文献 >Evaluating associative classification algorithms for Big Data
【2h】

Evaluating associative classification algorithms for Big Data

机译:评估大数据的关联分类算法

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Abstract Background Associative Classification, a combination of two important and different fields (classification and association rule mining), aims at building accurate and interpretable classifiers by means of association rules. A major problem in this field is that existing proposals do not scale well when Big Data are considered. In this regard, the aim of this work is to propose adaptations of well-known associative classification algorithms (CBA and CPAR) by considering different Big Data platforms (Spark and Flink). Results An experimental study has been performed on 40 datasets (30 classical datasets and 10 Big Data datasets). Classical data have been used to find which algorithms perform better sequentially. Big Data dataset have been used to prove the scalability of Big Data proposals. Results have been analyzed by means of non-parametric tests. Results proved that CBA-Spark and CBA-Flink obtained interpretable classifiers but it was more time consuming than CPAR-Spark or CPAR-Flink. In this study, it was demonstrated that the proposals were able to run on Big Data (file sizes up to 200 GBytes). The analysis of different quality metrics revealed that no statistical difference can be found for these two approaches. Finally, three different metrics (speed-up, scale-up and size-up) have also been analyzed to demonstrate that the proposals scale really well on Big Data. Conclusions The experimental study has revealed that sequential algorithms cannot be used on large quantities of data and approaches such as CBA-Spark, CBA-Flink, CPAR-Spark or CPAR-Flink are required. CBA has proved to be very useful when the main goal is to obtain highly interpretable results. However, when the runtime has to be minimized CPAR should be used. No statistical difference could be found between the two proposals in terms of quality of the results except for the interpretability of the final classifiers, CBA being statistically better than CPAR.
机译:摘要背景关联分类,两个重要和不同领域的组合(分类和关联规则采矿),旨在通过关联规则构建准确和可解释的分类器。该领域的一个主要问题是当考虑大数据时,现有的提案不会刻度。在这方面,这项工作的目的是通过考虑不同的大数据平台(火花和传递)来提出众所周知的关联分类算法(CBA和CPAR)的适应。结果在40个数据集(30个古典数据集和10大数据数据集)上进行了实验研究。古典数据已被用于找到哪些算法顺序执行更好。大数据数据集已被用于证明大数据提案的可扩展性。通过非参数测试分析了结果。结果证明,CBA-Spark和CBA-Flink获得了可解释的分类器,但它比CPAR-SPARK或CPAR-FLINK更耗时。在这项研究中,证明了提案能够在大数据上运行(文件大小最多200 GB)。对不同质量度量的分析显示,可以找到这两种方法的统计差异。最后,还已经分析了三种不同的指标(加速,扩大和大小),以证明提议在大数据上非常好。结论实验研究表明,需要在大量数据上使用顺序算法,并且需要诸如CBA-SPARK,CBA-FLINK,CPAR-SPARK或CPAR-FLINK的方法。当主要目标是获得高度可解释的结果时,CBA已经证明是非常有用的。但是,当必须最小化运行时最小化CPAR应该使用。除了最终分类器的可解释性外,两项建议之间可以在两项提案之间找到统计差异,CBA统计学比CPAR更好。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号