首页> 外文期刊>Journal of biomedicine & biotechnology >Unsupervised Two-Way Clustering of Metagenomic Sequences
【24h】

Unsupervised Two-Way Clustering of Metagenomic Sequences

机译:元基因组序列的无监督双向聚类

获取原文
获取原文并翻译 | 示例
           

摘要

A major challenge facing metagenomics is the development of tools for the characterization of functional and taxonomic content of vast amounts of short metagenome reads. The efficacy of clustering methods depends on the number of reads in the dataset, the read length and relative abundances of source genomes in the microbial community. In this paper, we formulate an unsupervised naive Bayes multispecies, multidimensional mixture model for reads from a metagenome. We use the proposed model to cluster metagenomic reads by their species of origin and to characterize the abundance of each species. We model the distribution of word counts along a genome as a Gaussian for shorter, frequent words and as a Poisson for longer words that are rare. We employ either a mixture of Gaussians or mixture of Poissons to model reads within each bin. Further, we handle the high-dimensionality and sparsity associated with the data, by grouping the set of words comprising the reads, resulting in a two-way mixture model. Finally, we demonstrate the accuracy and applicability of this method on simulated and real metagenomes. Our method can accurately cluster reads as short as 100 bps and is robust to varying abundances, divergences and read lengths.
机译:宏基因组学面临的主要挑战是开发用于表征大量短基因组读段的功能和分类学内容的工具。聚类方法的功效取决于数据集中读取的数量,读取的长度和微生物群落中源基因组的相对丰度。在本文中,我们为从元基因组中读取的数据建立了无监督的朴素贝叶斯多物种,多维混合模型。我们使用提出的模型来按其起源物种对宏基因组读物进行聚类,并表征每种物种的丰度。我们将整个基因组中的字数分布建模为高斯(Gaussian)表示较短,频繁的单词,而Poisson则表示较长的单词(很少见)。我们采用高斯混合或泊松混合来模拟每个仓中的读数。此外,我们通过对包含读段的一组单词进行分组来处理与数据相关的高维和稀疏性,从而形成双向混合模型。最后,我们证明了该方法在模拟和真实基因组上的准确性和适用性。我们的方法可以准确地将短至100 bps的读取聚类,并且对于改变丰度,差异和读取长度具有鲁棒性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号