首页> 外文会议>International Conference on Circuit, Power and Computing Technologies >Parallel frequent itemset mining with spark RDD framework for disease prediction
【24h】

Parallel frequent itemset mining with spark RDD framework for disease prediction

机译:使用Spark RDD框架并行频繁项集挖掘以进行疾病预测

获取原文

摘要

The aim behind frequent itemset mining is to find all common sets of items defined as those itemsets that have at least a minimum support. There are many well known algorithms for frequent itemset mining. Some of which are Apriori, Eclat, RElim, SaM, and FP-Growth. Although each of these algorithms is well formed and works in different scenarios, the main drawback of these algorithms is that they were designed to perform on small chunks of data. These limitations were imposed based on time that they were developed. The notion of big data was not up and running at these times. So in the present scenario these algorithms won't perform well on the current statistics of data present. So we propose a new approach of implementing these well known algorithms on a parallelized manner so that it can handle the data perfectly. The proposed work parallelizes, dynamic frequent itemset mining algorithm, Faster-IAPI with spark RDD framework. The main goal of selecting Apache Spark is that it overcomes the limitations of the Hadoop architecture which was basically designed to handle big data processing in a parallelized manner. The main drawback of the architecture was that it doesn't handle the Iterative algorithms very well. This drawback is rectified in spark which handles it well. In this approach this algorithm is applied to find correlation between different symptoms of patients in faster and efficient manner and provides the support for the prediction of occurrence of disease based on the symptoms.
机译:频繁进行项目集挖掘的目的是找到所有常见项目集,这些项目集定义为至少具有最小支持的那些项目集。有很多众所周知的用于频繁项集挖掘的算法。其中一些是Apriori,Eclat,RElim,SaM和FP-Growth。尽管这些算法中的每一种都结构良好且可以在不同的场景下工作,但是这些算法的主要缺点是它们被设计为对小块数据执行。这些限制是根据开发时间而强加的。在这些时候,大数据的概念还没有建立起来。因此,在当前情况下,这些算法在当前存在的数据统计信息上效果不佳。因此,我们提出了一种以并行方式实现这些众所周知的算法的新方法,以便它可以完美地处理数据。拟议的工作与火花RDD框架并行化,动态频繁项集挖掘算法Faster-IAPI。选择Apache Spark的主要目的是克服了Hadoop体系结构的局限性,该体系结构基本上旨在以并行方式处理大数据处理。该体系结构的主要缺点是,它不能很好地处理迭代算法。这个缺点在火花中得到了纠正,可以很好地处理它。在这种方法中,该算法可用于以更快,更有效的方式查找患者不同症状之间的相关性,并为基于症状的疾病发生预测提供支持。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号