首页> 外文期刊>BMC Bioinformatics >Efficient counting of k -mers in DNA sequences using a bloom filter
【24h】

Efficient counting of k -mers in DNA sequences using a bloom filter

机译:使用布隆过滤器对DNA序列中的k-mers进行高效计数

获取原文
           

摘要

Background Counting k-mers (substrings of length k in DNA sequence data) is an essential component of many methods in bioinformatics, including for genome and transcriptome assembly, for metagenomic sequencing, and for error correction of sequence reads. Although simple in principle, counting k-mers in large modern sequence data sets can easily overwhelm the memory capacity of standard computers. In current data sets, a large fraction-often more than 50%-of the storage capacity may be spent on storing k-mers that contain sequencing errors and which are typically observed only a single time in the data. These singleton k-mers are uninformative for many algorithms without some kind of error correction. Results We present a new method that identifies all the k-mers that occur more than once in a DNA sequence data set. Our method does this using a Bloom filter, a probabilistic data structure that stores all the observed k-mers implicitly in memory with greatly reduced memory requirements. We then make a second sweep through the data to provide exact counts of all nonunique k-mers. For example data sets, we report up to 50% savings in memory usage compared to current software, with modest costs in computational speed. This approach may reduce memory requirements for any algorithm that starts by counting k-mers in sequence data with errors. Conclusions A reference implementation for this methodology, BFCounter, is written in C++ and is GPL licensed. It is available for free download at http://pritch.bsd.uchicago.edu/bfcounter.html webcite
机译:背景计数k聚体(DNA序列数据中长度为k的子串)是生物信息学中许多方法的重要组成部分,包括用于基因组和转录组装配,用于宏基因组测序以及用于序列读取的错误校正。尽管原理上很简单,但对大型现代序列数据集中的k-mer进行计数很容易使标准计算机的存储容量不堪重负。在当前数据集中,可能会花费很大一部分存储容量(通常超过50%)来存储包含测序错误且通常仅在数据中一次观察到的k-mer。如果没有某种错误校正,这些单例k-mers对许多算法都无用。结果我们提出了一种新方法,该方法可识别DNA序列数据集中出现多次的所有k-mer。我们的方法使用Bloom过滤器来完成此任务,Bloom过滤器是一种概率数据结构,可将所有观察到的k-mers隐式存储在内存中,大大减少了内存需求。然后,我们对数据进行第二次扫描,以提供所有非唯一k-mers的准确计数。例如,与现有软件相比,我们报告的数据集最多可节省50%的内存使用量,而计算速度却不高。这种方法可以减少任何算法的内存需求,这些算法的开始需要对序列数据中有错误的k-mer进行计数。结论此方法的参考实现BFCounter用C ++编写,并获得GPL许可。可从http://pritch.bsd.uchicago.edu/bfcounter.html网站免费下载。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号