首页> 外文OA文献 >SECOM: A novel hash seed and community detection based-approach for genome-scale protein domain identification
【2h】

SECOM: A novel hash seed and community detection based-approach for genome-scale protein domain identification

机译:SECOM:一种新的基于哈希种子和社区检测的方法,用于基因组规模的蛋白质域鉴定

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

With rapid advances in the development of DNA sequencing technologies, a plethora of high-throughput genome and proteome data from a diverse spectrum of organisms have been generated. The functional annotation and evolutionary history of proteins are usually inferred from domains predicted from the genome sequences. Traditional database-based domain prediction methods cannot identify novel domains, however, and alignment-based methods, which look for recurring segments in the proteome, are computationally demanding. Here, we propose a novel genome-wide domain prediction method, SECOM. Instead of conducting all-against-all sequence alignment, SECOM first indexes all the proteins in the genome by using a hash seed function. Local similarity can thus be detected and encoded into a graph structure, in which each node represents a protein sequence and each edge weight represents the shared hash seeds between the two nodes. SECOM then formulates the domain prediction problem as an overlapping community-finding problem in this graph. A backward graph percolation algorithm that efficiently identifies the domains is proposed. We tested SECOM on five recently sequenced genomes of aquatic animals. Our tests demonstrated that SECOM was able to identify most of the known domains identified by InterProScan. When compared with the alignment-based method, SECOM showed higher sensitivity in detecting putative novel domains, while it was also three orders of magnitude faster. For example, SECOM was able to predict a novel sponge-specific domain in nucleoside-triphosphatase (NTPases). Furthermore, SECOM discovered two novel domains, likely of bacterial origin, that are taxonomically restricted to sea anemone and hydra. SECOM is an open-source program and available at http://sfb.kaust.edu.sa/Pages/Software.aspx. © 2012 Fan et al.
机译:随着DNA测序技术发展的迅速进步,已经产生了来自各种各样生物体的大量高通量基因组和蛋白质组数据。蛋白质的功能注释和进化历史通常是根据基因组序列预测的结构域推断的。传统的基于数据库的域预测方法无法识别新颖的域,但是在计算中需要寻找蛋白质组中重复片段的基于比对的方法。在这里,我们提出了一种新颖的全基因组域预测方法SECOM。 SECOM不会执行所有针对所有序列的比对,而是首先使用哈希种子函数对基因组中的所有蛋白质进行索引。因此可以检测局部相似度并将其编码为图结构,其中每个节点代表一个蛋白质序列,每个边缘权重代表两个节点之间共享的哈希种子。然后,SECOM在此图中将域预测问题表述为重叠的社区发现问题。提出了一种有效识别域的后向图渗滤算法。我们在五个最近测序的水生动物基因组上测试了SECOM。我们的测试表明SECOM能够识别InterProScan识别的大多数已知域。与基于比对的方法相比,SECOM在检测推定的新结构域方面显示出更高的灵敏度,但速度也快了三个数量级。例如,SECOM能够预测核苷三磷酸酶(NTPases)中一个新的海绵特异性结构域。此外,SECOM发现了两个新的域,可能是细菌来源的,它们在分类学上仅限于海葵和hydra。 SECOM是一个开源程序,可从http://sfb.kaust.edu.sa/Pages/Software.aspx获得。 ©2012 Fan等。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号