kClust: fast and sensitive clustering of large protein sequence databases

Maria Hauser; Christian E Mayer; Johannes S?ding

首页> 外文期刊>BMC Bioinformatics >kClust: fast and sensitive clustering of large protein sequence databases

【24h】

kClust: fast and sensitive clustering of large protein sequence databases

机译：kclust：大蛋白质序列数据库的快速和敏感聚类

获取原文

获取外文期刊封面封底 >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Background Fueled by rapid progress in high-throughput sequencing, the size of public sequence databases doubles every two years. Searching the ever larger and more redundant databases is getting increasingly inefficient. Clustering can help to organize sequences into homologous and functionally similar groups and can improve the speed, sensitivity, and readability of homology searches. However, because the clustering time is quadratic in the number of sequences, standard sequence search methods are becoming impracticable. Results Here we present a method to cluster large protein sequence databases such as UniProt within days down to 20%–30% maximum pairwise sequence identity. kClust owes its speed and sensitivity to an alignment-free prefilter that calculates the cumulative score of all similar 6-mers between pairs of sequences, and to a dynamic programming algorithm that operates on pairs of similar 4-mers. To increase sensitivity further, kClust can run in profile-sequence comparison mode, with profiles computed from the clusters of a previous kClust iteration. kClust is two to three orders of magnitude faster than clustering based on NCBI BLAST, and on multidomain sequences of 20%–30% maximum pairwise sequence identity it achieves comparable sensitivity and a lower false discovery rate. It also compares favorably to CD-HIT and UCLUST in terms of false discovery rate, sensitivity, and speed. Conclusions kClust fills the need for a fast, sensitive, and accurate tool to cluster large protein sequence databases to below 30% sequence identity. kClust is freely available under GPL at http://toolkit.lmb.uni-muenchen.de/pub/kClust/ webcite .

机译：背景通过高通量排序的快速进展引导，公共序列数据库的大小每两年加倍。搜索更大，更冗余的数据库越来越低效率。聚类可以帮助将序列组织成同源且功能类似的组，并且可以提高同源性搜索的速度，灵敏度和可读性。但是，因为聚类时间在序列的数量中是二次，所以标准序列搜索方法变得不切实际。结果在此，我们提出了一种方法来纳入大量蛋白质序列数据库，例如Uniprot在几天下降到20％-30％的最大成对序列标识。 kclust对一个可对齐的预过滤器造成了对对齐的预滤器，计算了对序列对之间所有类似6-MER的累积评分，以及以相似的4-MERS对操作的动态编程算法。为了进一步提高灵敏度，kclust可以以配置文件序列比较模式运行，具有从先前kclust迭代的集群计算的配置文件。 KClust比基于NCBI BLAST的聚类，并且在多域序列为20％-30％的最大成对序列标识的多域序列，它实现了相当的灵敏度和较低的虚假发现率。在虚假发现速率，灵敏度和速度方面，它还对CD-HET和UCLUST有利地比较。结论kclust填补了快速，敏感和准确的工具，将大蛋白质序列数据库纳入30％的序列同一性。 kclust在gpl下自由地提供http://toolkit.lmb.uni-muenchen.de/pub/kclust/ webcite。

著录项

来源
《BMC Bioinformatics》 |2013年第1期|共页
作者
Maria Hauser; Christian E Mayer; Johannes S?ding;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. UET: a database of evolutionarily-predicted functional determinants of protein sequences that cluster as functional sites in protein structures [J] . Angela D. Wilkins, Daniel H. Morgan, Daniel M. Konecki, Nucleic acids research . 2016,第D1期

机译：UET：蛋白质序列进化预测的功能决定簇的数据库，这些决定簇聚集在蛋白质结构中
2. FastaHerder2: Four Ways to Research Protein Function and Evolution with Clustering and Clustered Databases [J] . Mier Pablo, Andrade-Navarro Miguel A. Journal of computational biology: A journal of computational molecular cell biology . 2016,第4期

机译：FastaHerder2：使用聚类和聚类数据库研究蛋白质功能和进化的四种方法
3. FastaHerder2: Four Ways to Research Protein Function and Evolution with Clustering and Clustered Databases [J] . Andrade-NavarroMiguel A., MierPablo Journal of computational biology . 2016,第4期

机译：FastaHerder2：使用聚类和聚类数据库研究蛋白质功能和进化的四种方法
4. Clustering of Database Sequences for Fast Homology Search Using Upper Bounds on Alignment Score [C] . Masumi Itoh, Tatsuya Akutsu, Minoru Kanehisa International Workshop on Bioinformatics and Systems Biology . 2004

机译：群集数据库序列用于快速同源性搜索，在对齐分数上使用上限
5. FAC-PIN: An efficient and fast agglomerative clustering algorithm for protein interaction networks to predict protein complexes and functional modules. [D] . Rahman, Mohammad Shamsur. 2013

机译：FAC-PIN：一种高效且快速的聚集聚类算法，用于蛋白质相互作用网络来预测蛋白质复合物和功能模块。
6. kClust: fast and sensitive clustering of large protein sequence databases [O] . Maria Hauser, Christian E Mayer, Johannes Söding 2013

机译：kClust：大型蛋白质序列数据库的快速灵敏聚类
7. kClust: fast and sensitive clustering of large protein sequence databases [O] . 2013

机译：kClust：大型蛋白质序列数据库的快速灵敏聚类
8. Clustering and Visualization of Large Protein Sequence Databases by Means of anExtension of the Self-Organizing Map [R] . Somervuo, P., Kohonen, T. 2000

机译：利用自组织映射的扩展对大蛋白质序列数据库进行聚类和可视化

kClust: fast and sensitive clustering of large protein sequence databases

摘要

著录项

相似文献

相关主题

期刊订阅