The Metagenomic Binning Problem: Clustering Markov Sequences

机译：元基因组分箱问题：聚类马尔可夫序列

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

The goal of metagenomics is to study the composition of microbial communities, typically using high-throughput shotgun sequencing. In the metagenomic binning problem, we observe random substrings (called contigs) from a mixture of genomes and want to cluster them according to their genome of origin. Based on the empirical observation that genomes of different bacterial species can be distinguished based on their tetranucleotide frequencies, we model this task as the problem of clustering N sequences generated by M distinct Markov processes, where M≪N. Utilizing the large-deviation principle for Markov processes, we establish the information-theoretic limit for perfect binning. Specifically, we show that the length of the contigs must scale with the inverse of the Chernoff Information between the two most similar species. Our result also implies that contigs should be binned using the conditional relative entropy as a measure of distance, as opposed to the Euclidean distance often used in practice.

机译：宏基因组学的目标是研究微生物群落的组成，通常使用高通量shot弹枪测序。在宏基因组分类问题中，我们观察到来自基因组混合物的随机子串（称为重叠群），并希望根据其起源基因组对其进行聚类。基于经验观察，不同细菌物种的基因组可以基于它们的四核苷酸频率进行区分，我们将此任务建模为由M个不同的马尔可夫过程（其中M≪N）产生的N个序列聚类的问题。利用马尔可夫过程的大偏差原理，我们建立了完美分档的信息理论极限。具体来说，我们表明重叠群的长度必须与两个最相似物种之间的切尔诺夫信息的倒数成比例。我们的结果还暗示，重叠群应该使用条件相对熵作为距离的量级，而不是在实践中经常使用的欧几里得距离。

著录项

来源
《IEEE Information Theory Workshop》|2019年|1-5|共5页
会议地点
作者
Grant Greenberg; Ilan Shomorony;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
bioinformatics; entropy; genetics; genomics; Markov processes; microorganisms; pattern clustering;

机译：生物信息学;熵;遗传学;基因组学;马尔可夫过程;微生物;模式聚类;

相似文献

外文文献
中文文献
专利

1. HSS-Bin: An Unsupervised Metagenomic Binning Method Based on Hybrid Sequence Feature Recognition and Spectral Clustering [J] . Ding Xiao, Cao Chang-Chang, Liu Xu-Ying, Current Bioinformatics . 2016,第3期

机译：HSS-Bin：基于混合序列特征识别和谱聚类的无监督元基因组合并方法
2. Accurate binning of metagenomic contigs via automated clustering sequences using information of genomic signatures and marker genes [J] . Hsin-Hung Lin, Yu-Chieh Liao Scientific reports. . 2016,第1期

机译：通过使用基因组特征和标记基因的信息，通过自动聚类序列精确分叉Metagenomic Contigs
3. Clustering metagenomic sequences with interpolated Markov models [J] . David R Kelley, Steven L Salzberg BMC Bioinformatics . 2010,第1期

机译：内插马尔可夫模型聚类宏基因组序列
4. The Metagenomic Binning Problem: Clustering Markov Sequences [C] . Grant Greenberg, Ilan Shomorony IEEE Information Theory Workshop . 2019

机译：Metagenomic Binning问题：聚类马尔可夫序列
5. Efficient Sequence Clustering and Embedding Algorithms for Large-scale Metagenomics Data [D] . Zheng, Wei. 2019

机译：大规模偏心组织数据的高效序列聚类和嵌入算法
6. Accurate binning of metagenomic contigs via automated clustering sequences using information of genomic signatures and marker genes [O] . Hsin-Hung Lin, Yu-Chieh Liao -1

机译：使用基因组特征和标记基因信息通过自动聚类序列对宏基因组重叠群进行准确分箱
7. Accurate binning of metagenomic contigs via automated clustering sequences using information of genomic signatures and marker genes [O] . Lin, HH 2016

机译：使用基因组特征和标记基因信息通过自动聚类序列对宏基因组重叠群进行准确分箱

The Metagenomic Binning Problem: Clustering Markov Sequences

摘要

著录项

相似文献

相关主题

期刊订阅