首页> 外文会议>IEEE Information Theory Workshop >The Metagenomic Binning Problem: Clustering Markov Sequences
【24h】

The Metagenomic Binning Problem: Clustering Markov Sequences

机译:元基因组分箱问题:聚类马尔可夫序列

获取原文

摘要

The goal of metagenomics is to study the composition of microbial communities, typically using high-throughput shotgun sequencing. In the metagenomic binning problem, we observe random substrings (called contigs) from a mixture of genomes and want to cluster them according to their genome of origin. Based on the empirical observation that genomes of different bacterial species can be distinguished based on their tetranucleotide frequencies, we model this task as the problem of clustering N sequences generated by M distinct Markov processes, where M≪N. Utilizing the large-deviation principle for Markov processes, we establish the information-theoretic limit for perfect binning. Specifically, we show that the length of the contigs must scale with the inverse of the Chernoff Information between the two most similar species. Our result also implies that contigs should be binned using the conditional relative entropy as a measure of distance, as opposed to the Euclidean distance often used in practice.
机译:宏基因组学的目标是研究微生物群落的组成,通常使用高通量shot弹枪测序。在宏基因组分类问题中,我们观察到来自基因组混合物的随机子串(称为重叠群),并希望根据其起源基因组对其进行聚类。基于经验观察,不同细菌物种的基因组可以基于它们的四核苷酸频率进行区分,我们将此任务建模为由M个不同的马尔可夫过程(其中M≪N)产生的N个序列聚类的问题。利用马尔可夫过程的大偏差原理,我们建立了完美分档的信息理论极限。具体来说,我们表明重叠群的长度必须与两个最相似物种之间的切尔诺夫信息的倒数成比例。我们的结果还暗示,重叠群应该使用条件相对熵作为距离的量级,而不是在实践中经常使用的欧几里得距离。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号