...
首页> 外文期刊>Journal of Emerging Technologies in Web Intelligence >An Efficient Mining for Approximate Frequent Items in Protein Sequence Database
【24h】

An Efficient Mining for Approximate Frequent Items in Protein Sequence Database

机译:高效的蛋白质序列数据库中常见项目的挖掘

获取原文
           

摘要

—The rapid increase of available proteins, DNA and other biological sequences has made the problem of discovering the meaningful patterns from sequences, a major task for Bioinformatics research. Data mining of protein sequence databases poses special challenges, because several protein databases are non-relational whereas most of the data mining and machine learning techniques considers the data input to be a relational database. The existing sequence mining algorithms mainly focus on mining for subsequences. However, a wide range of applications such as biological DNA and protein motif mining needs an effective mining for identifying the approximate frequent patterns. The existing approximate frequent pattern mining algorithms have some delimitations such as lack of knowledge to finding the patterns, poor scalability and complexity to adapt into some other applications. In this paper, a Generalized Approximate Pattern Algorithm (GAPA) is proposed to efficiently mine the approximate frequent patterns in the protein sequence database. Pearson’s coefficient correlation is computed among the protein sequence database items to analyze the approximate frequent patterns. The performance of the proposed GAPA is analyzed and tested with the FASTA protein sequence database. FASTA database files hold the protein translations of Ensembl gene predictions. GAPA is compared with the existing methods such as Approximate Frequent Itemsets (AFI) tree and Approximate Closed Frequent Itemsets (ACFIM) in terms of support, accuracy, memory usage and time consumption. The experimental results shows GAPA is scalable and outperforms than the existing algorithms.
机译:-可用蛋白质,DNA和其他生物序列的迅速增加使从序列中发现有意义的模式成为了生物信息学研究的主要任务。蛋白质序列数据库的数据挖掘提出了特殊的挑战,因为一些蛋白质数据库是非关系数据库,而大多数数据挖掘和机器学习技术都将数据输入视为关系数据库。现有的序列挖掘算法主要集中于子序列的挖掘。但是,诸如生物DNA和蛋白质基序挖掘的广泛应用需要有效的挖掘来识别近似的频繁模式。现有的近似频繁模式挖掘算法具有一些局限性,例如缺乏对模式的了解,可伸缩性差,难以适应其他一些应用程序的复杂性。本文提出了一种通用近似模式算法(GAPA)来有效地挖掘蛋白质序列数据库中的近似频繁模式。在蛋白质序列数据库项目之间计算Pearson的系数相关性,以分析近似的频繁模式。建议的GAPA的性能已通过FASTA蛋白序列数据库进行了分析和测试。 FASTA数据库文件包含Ensembl基因预测的蛋白质翻译。在支持,准确性,内存使用和时间消耗方面,将GAPA与现有方法(例如,近似频繁项目集(AFI)树和近似封闭频繁项目集(ACFIM))进行了比较。实验结果表明,GAPA具有可扩展性,并且性能优于现有算法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号