首页> 外文期刊>Nucleic acids research >ESPRIT-Tree: hierarchical clustering analysis of millions of 16S rRNA pyrosequences in quasilinear computational time
【24h】

ESPRIT-Tree: hierarchical clustering analysis of millions of 16S rRNA pyrosequences in quasilinear computational time

机译:ESPRIT-Tree:准线性计算时间内数百万个16S rRNA焦磷酸序列的层次聚类分析

获取原文
       

摘要

Taxonomy-independent analysis plays an essential role in microbial community analysis. Hierarchical clustering is one of the most widely employed approaches to finding operational taxonomic units, the basis for many downstream analyses. Most existing algorithms have quadratic space and computational complexities, and thus can be used only for small or medium-scale problems. We propose a new online learning-based algorithm that simultaneously addresses the space and computational issues of prior work. The basic idea is to partition a sequence space into a set of subspaces using a partition tree constructed using a pseudometric, then recursively refine a clustering structure in these subspaces. The technique relies on new methods for fast closest-pair searching and efficient dynamic insertion and deletion of tree nodes. To avoid exhaustive computation of pairwise distances between clusters, we represent each cluster of sequences as a probabilistic sequence, and define a set of operations to align these probabilistic sequences and compute genetic distances between them. We present analyses of space and computational complexity, and demonstrate the effectiveness of our new algorithm using a human gut microbiota data set with over one million sequences. The new algorithm exhibits a quasilinear time and space complexity comparable to greedy heuristic clustering algorithms, while achieving a similar accuracy to the standard hierarchical clustering algorithm.
机译:非分类学分析在微生物群落分析中起着至关重要的作用。层次聚类是找到操作分类单位的最广泛使用的方法之一,这是许多下游分析的基础。大多数现有算法具有二次空间和计算复杂性,因此只能用于中小型问题。我们提出了一种新的基于在线学习的算法,该算法可同时解决先前工作的空间和计算问题。基本思想是使用使用伪度量构造的分区树将序列空间划分为一组子空间,然后在这些子空间中递归地优化聚类结构。该技术依靠新的方法进行快速的最接近对搜索以及有效的动态插入和删除树节点。为避免详尽计算簇之间的成对距离,我们将序列的每个簇表示为一个概率序列,并定义一组操作以对齐这些概率序列并计算它们之间的遗传距离。我们目前对空间和计算复杂性进行分析,并使用具有超过一百万个序列的人类肠道菌群数据集证明了我们新算法的有效性。新算法展现了与贪婪启发式聚类算法相当的准线性时间和空间复杂度,同时实现了与标准分层聚类算法相似的准确性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号