首页> 外文会议>International Conference on Green Technology and Sustainable Development >A Novel Binning Algorithm Using Topic Modelling and k-mer Frequency on Groups of Non-Overlapping Short Reads
【24h】

A Novel Binning Algorithm Using Topic Modelling and k-mer Frequency on Groups of Non-Overlapping Short Reads

机译:一种使用主题建模和k-MER频率的新型融合算法,非重叠短读数组

获取原文

摘要

Metagenomics is a field that studies the microorganisms from the environment itself instead of traditional culturing methods. In this paper, we focus on the binning problem, which is to group reads into clusters that highly represent a taxonomic group. The result of this step serves as a crucial input for the next one of a metagenomic project such as assembly and annotation. Because metagenomic reads does not have explicit features, it is not easy to divide them into distinct groups. The solutions for this binning problem can be categorized as supervised and unsupervised approaches. Supervised ones need a reference database, which is unfortunately about 1% of the microorganisms in nature. This prevents these approaches from working well with the dataset that contain unknown species. In this paper we follow an unsupervised approach. Our proposed method is to combine the result from another technique named BiMeta, which based on a biological signature assumption that reads of a same taxonomic label have a same k-mer distribution, and topic modelling as a way of reducing the dimensions of the dataset. Our method shows better results (by precision, recall, and F-measure) than BiMeta on most datasets. Although following BiMeta, LDABiMeta out- performs it with the new proposed ideas. Moreover, our method is equiv- alent to MetaProb, which is the most successful method at present time, for the short-read datasets.
机译:Metagenomics是一种研究从环境本身的微生物而不是传统的培养方法。在本文中,我们专注于分批问题,即对集团读入高度代表分类组的群集。该步骤的结果用作诸如组装和注释之类的偏心群项目的下一个的关键输入。因为梅毒读数没有明确的特征,因为将它们分成不同的群体并不容易。该分箱问题的解决方案可以分类为监督和无监督的方法。监督员需要一个参考数据库,遗憾的是,大自然中的微生物占1%。这可以防止这些方法与包含未知物种的数据集一起工作。在本文中,我们遵循无人监督的方法。我们所提出的方法是将来自另一种名为Bimeta的技术的结果组合,这基于生物签名假设,即相同的分类标签的读取具有相同的k-mer分布,主题建模作为减少数据集的尺寸的方法。我们的方法显示出比大多数数据集上的Bimeta更好的结果(通过精度,召回和f-measet)。虽然在Bimeta之后,Ldabimeta与新的拟议想法一起开展了它。此外,我们的方法与Metaprob相同,这是当前时间最成功的方法,对于短读数据集。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号