A Novel Binning Algorithm Using Topic Modelling and k-mer Frequency on Groups of Non-Overlapping Short Reads

机译：一种使用主题建模和k-MER频率的新型融合算法，非重叠短读数组

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Metagenomics is a field that studies the microorganisms from the environment itself instead of traditional culturing methods. In this paper, we focus on the binning problem, which is to group reads into clusters that highly represent a taxonomic group. The result of this step serves as a crucial input for the next one of a metagenomic project such as assembly and annotation. Because metagenomic reads does not have explicit features, it is not easy to divide them into distinct groups. The solutions for this binning problem can be categorized as supervised and unsupervised approaches. Supervised ones need a reference database, which is unfortunately about 1% of the microorganisms in nature. This prevents these approaches from working well with the dataset that contain unknown species. In this paper we follow an unsupervised approach. Our proposed method is to combine the result from another technique named BiMeta, which based on a biological signature assumption that reads of a same taxonomic label have a same k-mer distribution, and topic modelling as a way of reducing the dimensions of the dataset. Our method shows better results (by precision, recall, and F-measure) than BiMeta on most datasets. Although following BiMeta, LDABiMeta out- performs it with the new proposed ideas. Moreover, our method is equiv- alent to MetaProb, which is the most successful method at present time, for the short-read datasets.

机译：Metagenomics是一种研究从环境本身的微生物而不是传统的培养方法。在本文中，我们专注于分批问题，即对集团读入高度代表分类组的群集。该步骤的结果用作诸如组装和注释之类的偏心群项目的下一个的关键输入。因为梅毒读数没有明确的特征，因为将它们分成不同的群体并不容易。该分箱问题的解决方案可以分类为监督和无监督的方法。监督员需要一个参考数据库，遗憾的是，大自然中的微生物占1％。这可以防止这些方法与包含未知物种的数据集一起工作。在本文中，我们遵循无人监督的方法。我们所提出的方法是将来自另一种名为Bimeta的技术的结果组合，这基于生物签名假设，即相同的分类标签的读取具有相同的k-mer分布，主题建模作为减少数据集的尺寸的方法。我们的方法显示出比大多数数据集上的Bimeta更好的结果（通过精度，召回和f-measet）。虽然在Bimeta之后，Ldabimeta与新的拟议想法一起开展了它。此外，我们的方法与Metaprob相同，这是当前时间最成功的方法，对于短读数据集。

著录项

来源
《International Conference on Green Technology and Sustainable Development》|2020年|380-386|共7页
会议地点
作者
Hoang D. Quach; Hoang T. Lam; Dang H. N. Nguyen; Phuong V. D. Van; Van Hoai Tran;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Urban areas; DNA; Genomics; Sustainable development; Natural language processing; Databases; Computer science;

机译：城市地区;DNA;基因组学;可持续发展;自然语言处理;数据库;计算机科学;

相似文献

外文文献
中文文献
专利

1. A two-phase binning algorithm using l-mer frequency on groups of non-overlapping reads [J] . Le V Vinh, Tran V Lang, Le T Binh, Algorithms for Molecular Biology . 2015,第1期

机译：在不重叠读取的组上使用l-mer频率的两阶段合并算法
2. Binning unassembled short reads based on k-mer abundance covariance using sparse coding [J] . Kyrgyzov Olexiy, Prost Vincent, Gazut Stéphane, GigaScience . 2020,第4期

机译：基于K-MER丰富协方差使用稀疏编码，分融合未使用的简短读数
3. Exploiting topic modeling to boost metagenomic reads binning [J] . Ruichang Zhang, Zhanzhan Cheng, Jihong Guan, BMC Bioinformatics . 2015,第SUPPLEMENTa5期

机译：利用主题建模来提高宏基因组阅读分箱
4. Poisson-Markov Mixture Model and Parallel Algorithm for Binning Massive and Heterogenous DNA Sequencing Reads [C] . Lu Wang, Dongxiao Zhu, Yan Li, International symposium on bioinformatics research and applications . 2016

机译：用于大规模和异质DNA测序读段的装箱的Poisson-Markov混合模型和并行算法
5. Fast Algorithms for Dynamic Text Indexing and Short Read Alignment [D] . Sanjeev, Komal 2016

机译：动态文本索引和短读对齐的快速算法
6. A two-phase binning algorithm using l-mer frequency on groups of non-overlapping reads [O] . Le Van Vinh, Tran Van Lang, Le Thanh Binh, 2015

机译：在非重叠读取组上使用l-mer频率的两阶段合并算法
7. A two-phase binning algorithm using -mer frequency on groups of non-overlapping reads [O] . 2015

机译：在非重叠读取组中使用-mer频率的两阶段合并算法

A Novel Binning Algorithm Using Topic Modelling and k-mer Frequency on Groups of Non-Overlapping Short Reads

摘要

著录项

相似文献

相关主题

期刊订阅