...
首页> 外文期刊>BMC Bioinformatics >kClust: fast and sensitive clustering of large protein sequence databases
【24h】

kClust: fast and sensitive clustering of large protein sequence databases

机译:kclust:大蛋白质序列数据库的快速和敏感聚类

获取原文
   

获取外文期刊封面封底 >>

       

摘要

Background Fueled by rapid progress in high-throughput sequencing, the size of public sequence databases doubles every two years. Searching the ever larger and more redundant databases is getting increasingly inefficient. Clustering can help to organize sequences into homologous and functionally similar groups and can improve the speed, sensitivity, and readability of homology searches. However, because the clustering time is quadratic in the number of sequences, standard sequence search methods are becoming impracticable. Results Here we present a method to cluster large protein sequence databases such as UniProt within days down to 20%–30% maximum pairwise sequence identity. kClust owes its speed and sensitivity to an alignment-free prefilter that calculates the cumulative score of all similar 6-mers between pairs of sequences, and to a dynamic programming algorithm that operates on pairs of similar 4-mers. To increase sensitivity further, kClust can run in profile-sequence comparison mode, with profiles computed from the clusters of a previous kClust iteration. kClust is two to three orders of magnitude faster than clustering based on NCBI BLAST, and on multidomain sequences of 20%–30% maximum pairwise sequence identity it achieves comparable sensitivity and a lower false discovery rate. It also compares favorably to CD-HIT and UCLUST in terms of false discovery rate, sensitivity, and speed. Conclusions kClust fills the need for a fast, sensitive, and accurate tool to cluster large protein sequence databases to below 30% sequence identity. kClust is freely available under GPL at http://toolkit.lmb.uni-muenchen.de/pub/kClust/ webcite .
机译:背景通过高通量排序的快速进展引导,公共序列数据库的大小每两年加倍。搜索更大,更冗余的数据库越来越低效率。聚类可以帮助将序列组织成同源且功能类似的组,并且可以提高同源性搜索的速度,灵敏度和可读性。但是,因为聚类时间在序列的数量中是二次,所以标准序列搜索方法变得不切实际。结果在此,我们提出了一种方法来纳入大量蛋白质序列数据库,例如Uniprot在几天下降到20%-30%的最大成对序列标识。 kclust对一个可对齐的预过滤器造成了对对齐的预滤器,计算了对序列对之间所有类似6-MER的累积评分,以及以相似的4-MERS对操作的动态编程算法。为了进一步提高灵敏度,kclust可以以配置文件序列比较模式运行,具有从先前kclust迭代的集群计算的配置文件。 KClust比基于NCBI BLAST的聚类,并且在多域序列为20%-30%的最大成对序列标识的多域序列,它实现了相当的灵敏度和较低的虚假发现率。在虚假发现速率,灵敏度和速度方面,它还对CD-HET和UCLUST有利地比较。结论kclust填补了快速,敏感和准确的工具,将大蛋白质序列数据库纳入30%的序列同一性。 kclust在gpl下自由地提供http://toolkit.lmb.uni-muenchen.de/pub/kclust/ webcite。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号