Efficient counting of k -mers in DNA sequences using a bloom filter

Páll Melsted; Jonathan K Pritchard

首页> 外文期刊>BMC Bioinformatics >Efficient counting of k -mers in DNA sequences using a bloom filter

【24h】

Efficient counting of k -mers in DNA sequences using a bloom filter

机译：使用布隆过滤器对DNA序列中的k-mers进行高效计数

获取原文

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Background Counting k-mers (substrings of length k in DNA sequence data) is an essential component of many methods in bioinformatics, including for genome and transcriptome assembly, for metagenomic sequencing, and for error correction of sequence reads. Although simple in principle, counting k-mers in large modern sequence data sets can easily overwhelm the memory capacity of standard computers. In current data sets, a large fraction-often more than 50%-of the storage capacity may be spent on storing k-mers that contain sequencing errors and which are typically observed only a single time in the data. These singleton k-mers are uninformative for many algorithms without some kind of error correction. Results We present a new method that identifies all the k-mers that occur more than once in a DNA sequence data set. Our method does this using a Bloom filter, a probabilistic data structure that stores all the observed k-mers implicitly in memory with greatly reduced memory requirements. We then make a second sweep through the data to provide exact counts of all nonunique k-mers. For example data sets, we report up to 50% savings in memory usage compared to current software, with modest costs in computational speed. This approach may reduce memory requirements for any algorithm that starts by counting k-mers in sequence data with errors. Conclusions A reference implementation for this methodology, BFCounter, is written in C++ and is GPL licensed. It is available for free download at http://pritch.bsd.uchicago.edu/bfcounter.html webcite

机译：背景计数k聚体（DNA序列数据中长度为k的子串）是生物信息学中许多方法的重要组成部分，包括用于基因组和转录组装配，用于宏基因组测序以及用于序列读取的错误校正。尽管原理上很简单，但对大型现代序列数据集中的k-mer进行计数很容易使标准计算机的存储容量不堪重负。在当前数据集中，可能会花费很大一部分存储容量（通常超过50％）来存储包含测序错误且通常仅在数据中一次观察到的k-mer。如果没有某种错误校正，这些单例k-mers对许多算法都无用。结果我们提出了一种新方法，该方法可识别DNA序列数据集中出现多次的所有k-mer。我们的方法使用Bloom过滤器来完成此任务，Bloom过滤器是一种概率数据结构，可将所有观察到的k-mers隐式存储在内存中，大大减少了内存需求。然后，我们对数据进行第二次扫描，以提供所有非唯一k-mers的准确计数。例如，与现有软件相比，我们报告的数据集最多可节省50％的内存使用量，而计算速度却不高。这种方法可以减少任何算法的内存需求，这些算法的开始需要对序列数据中有错误的k-mer进行计数。结论此方法的参考实现BFCounter用C ++编写，并获得GPL许可。可从http://pritch.bsd.uchicago.edu/bfcounter.html网站免费下载。

著录项

来源
《BMC Bioinformatics》 |2011年第1期|共页
作者
Páll Melsted; Jonathan K Pritchard;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类生物科学;
关键词

相似文献

外文文献
中文文献
专利

1. Improving Bloom Filter Performance on Sequence Data Using k-mer Bloom Filters [J] . Pellow David, Filippova Darya, Kingsford Carl Journal of computational biology: A journal of computational molecular cell biology . 2017,第6期

机译：使用K-MER Bloom滤波器提高序列数据的绽放过滤性能
2. Classification of DNA sequences using Bloom filters [J] . Stranneheim, Henrik, Kaller, Max, Allander, Tobias, Bioinformatics . 2010,第13期

机译：使用Bloom过滤器对DNA序列进行分类
3. Classification of DNA sequences using Bloom filters [J] . Joakim Lundeberg Bioinformatics . 2010,第13期

机译：使用Bloom过滤器对DNA序列进行分类
4. Improving Bloom Filter Performance on Sequence Data Using k-mer Bloom Filters [C] . David Pellow, Darya Filippova, Carl Kingsford Annual international conference on research in computational molecular biology . 2016

机译：使用k-mer布隆过滤器提高序列数据上的布隆过滤器性能
5. A Selection of DNAs That Efficiently Form Small Circles and Their Comparison with Nucleosome Sequences [D] . Rosanio, Gabriel John 2015

机译：有效形成小圆圈的DNA的选择及其与核小体序列的比较
6. Efficient counting of k-mers in DNA sequences using a bloom filter [O] . Páll Melsted, Jonathan K Pritchard 2011

机译：使用布隆过滤器有效计数DNA序列中的k-mers
7. Efficient counting of k-mers in DNA sequences using a bloom filter [O] . Páll Melsted, Jonathan K Pritchard 2011

机译：使用布隆过滤器有效计数DNA序列中的k-mers

Efficient counting of k -mers in DNA sequences using a bloom filter

摘要

著录项

相似文献

相关主题

期刊订阅