首页> 外文会议>IEEE International Conference on Big Data >DLA: a Distributed, Location-based and Apriori-based Algorithm for Biological Sequence Pattern Mining
【24h】

DLA: a Distributed, Location-based and Apriori-based Algorithm for Biological Sequence Pattern Mining

机译:DLA:用于生物序列模式挖掘的分布式,基于位置和基于先验的算法

获取原文
获取外文期刊封面目录资料

摘要

With the rapid growth of genomic data, the need for scalable data mining algorithms has increased. Frequent contiguous sequence mining is a technique that can help biologists to better understand the function and structure of our DNA, by capturing the common characteristics among related sequences. Many sequence mining algorithms have been developed over time. However, most of them suffer from scaling issues when dealing with big data or give no warranty for the completeness of their result. In this paper, we propose a distributed sequential pattern mining algorithm implemented on Apache Spark. Specifically, the algorithm exploits the Apriori Property and information about each patterns location within the original sequence, to drastically reduce the number of candidates at each iteration. Experimental results on real-world datasets confirm our performance expectations, showing a better scalability when compared to other distributed solutions.
机译:随着基因组数据的快速增长,对可伸缩数据挖掘算法的需求不断增长。频繁的连续序列挖掘是一种技术,它可以通过捕获相关序列之间的共同特征来帮助生物学家更好地了解我们DNA的功能和结构。随着时间的推移,已经开发了许多序列挖掘算法。但是,它们中的大多数在处理大数据时会遇到扩展问题,或者对结果的完整性不做任何保证。在本文中,我们提出了一种在Apache Spark上实现的分布式顺序模式挖掘算法。具体而言,该算法利用Apriori属性和有关原始序列中每个模式位置的信息,以大幅度减少每次迭代中的候选数。实际数据集上的实验结果证实了我们对性能的期望,与其他分布式解决方案相比,显示了更好的可伸缩性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号