...
首页> 外文期刊>BMC Bioinformatics >Fast batch searching for protein homology based on compression and clustering
【24h】

Fast batch searching for protein homology based on compression and clustering

机译:基于压缩和聚类的快速批次搜索蛋白质同源性

获取原文
   

获取外文期刊封面封底 >>

       

摘要

In bioinformatics community, many tasks associate with matching a set of protein query sequences in large sequence datasets. To conduct multiple queries in the database, a common used method is to run BLAST on each original querey or on the concatenated queries. It is inefficient since it doesn’t exploit the common subsequences shared by queries. We propose a compression and cluster based BLASTP (C2-BLASTP) algorithm to further exploit the joint information among the query sequences and the database. Firstly, the queries and database are compressed in turn by procedures of redundancy analysis, redundancy removal and distinction record. Secondly, the database is clustered according to Hamming distance among the subsequences. To improve the sensitivity and selectivity of sequence alignments, ten groups of reduced amino acid alphabets are used. Following this, the hits finding operator is implemented on the clustered database. Furthermore, an execution database is constructed based on the found potential hits, with the objective of mitigating the effect of increasing scale of the sequence database. Finally, the homology search is performed in the execution database. Experiments on NCBI NR database demonstrate the effectiveness of the proposed C2-BLASTP for batch searching of homology in sequence database. The results are evaluated in terms of homology accuracy, search speed and memory usage. It can be seen that the C2-BLASTP achieves competitive results as compared with some state-of-the-art methods.
机译:在生物信息学界,许多任务与匹配大型序列数据集中的一组蛋白质查询序列相关。要在数据库中进行多个查询,一种常用的方法是在每个原始查询或连接的查询上运行BLAST。由于它没有利用查询共享的公共子序列,因此效率很低。我们提出了一种基于压缩和聚类的BLASTP(C2-BLASTP)算法,以进一步利用查询序列和数据库之间的联合信息。首先,通过冗余分析,冗余去除和区别记录的过程依次压缩查询和数据库。其次,根据子序列之间的汉明距离对数据库进行聚类。为了提高序列比对的灵敏度和选择性,使用了十组还原氨基酸字母。之后,在群集数据库上实现了命中查找运算符。此外,基于发现的潜在命中来构建执行数据库,目的是减轻序列数据库规模扩大的影响。最后,在执行数据库中执行同源性搜索。在NCBI NR数据库上进行的实验证明了所提出的C2-BLASTP在序列数据库中批量搜索同源性的有效性。根据同源性准确性,搜索速度和内存使用情况评估结果。可以看出,与某些最新方法相比,C2-BLASTP获得了竞争性结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号