首页> 美国卫生研究院文献>other >These Are Not the K-mers You Are Looking For: Efficient Online K-mer Counting Using a Probabilistic Data Structure

【2h】

These Are Not the K-mers You Are Looking For: Efficient Online K-mer Counting Using a Probabilistic Data Structure

机译：这些不是您要找的K-mer：使用概率数据结构的高效在线K-mer计数

代理获取

本网站仅为用户提供外文OA文献查询和代理获取服务，本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文，但由于OA文献来源多样且变更频繁，仍可能出现获取不到、文献不完整或与标题不符等情况，如果获取不到我们将提供退款服务。请知悉。

页面导航

摘要
著录项
相似文献
相关主题

摘要

K-mer abundance analysis is widely used for many purposes in nucleotide sequence analysis, including data preprocessing for de novo assembly, repeat detection, and sequencing coverage estimation. We present the khmer software package for fast and memory efficient online counting of k-mers in sequencing data sets. Unlike previous methods based on data structures such as hash tables, suffix arrays, and trie structures, khmer relies entirely on a simple probabilistic data structure, a Count-Min Sketch. The Count-Min Sketch permits online updating and retrieval of k-mer counts in memory which is necessary to support online k-mer analysis algorithms. On sparse data sets this data structure is considerably more memory efficient than any exact data structure. In exchange, the use of a Count-Min Sketch introduces a systematic overcount for k-mers; moreover, only the counts, and not the k-mers, are stored. Here we analyze the speed, the memory usage, and the miscount rate of khmer for generating k-mer frequency distributions and retrieving k-mer counts for individual k-mers. We also compare the performance of khmer to several other k-mer counting packages, including Tallymer, Jellyfish, BFCounter, DSK, KMC, Turtle and KAnalyze. Finally, we examine the effectiveness of profiling sequencing error, k-mer abundance trimming, and digital normalization of reads in the context of high khmer false positive rates. khmer is implemented in C++ wrapped in a Python interface, offers a tested and robust API, and is freely available under the BSD license at github.com/ged-lab/khmer.

机译：K-mer丰度分析在核苷酸序列分析中被广泛用于许多目的，包括从头组装的数据预处理，重复检测和测序覆盖率估计。我们提供了khmer软件包，用于对测序数据集中的k-mers进行快速和高效的内存在线计数。与先前基于哈希表，后缀数组和trie结构等数据结构的方法不同，高棉完全依赖于简单的概率数据结构，即Count-Min Sketch。 Count-Min Sketch允许在线更新和检索内存中的k-mer计数，这是支持在线k-mer分析算法所必需的。在稀疏数据集上，此数据结构比任何确切的数据结构都具有更高的内存效率。作为交换，使用Count-Min Sketch会导致k-mers的系统计数过高。而且，仅存储计数，而不存储k聚体。在这里，我们分析了高棉的速度，内存使用情况和高棉错记率，以生成k-mer频率分布并检索单个k-mer的k-mer计数。我们还将高棉与其他几种高聚物计数软件包的性能进行了比较，包括Tallymer，水母，BFCounter，DSK，KMC，Turtle和KAnalyze。最后，我们检查了在高棉假阳性率较高的情况下，分析测序错误，k-mer丰度修剪和读数的数字归一化的有效性。高棉语是用封装在Python接口中的C ++实现的，提供了经过测试的强大API，并且可以通过github.com/ged-lab/khmer的BSD许可免费获得。

著录项

期刊名称 other
作者
Qingpeng Zhang; Jason Pell; Rosangela Canino-Koning; Adina Chuang Howe; C. Titus Brown;
展开▼
作者单位

展开▼
年(卷),期 -1(9),7
年度 -1
页码 e101271
总页数 13
原文格式 PDF
正文语种
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. MQF and buffered MQF: quotient filters for efficient storage of k-mers with their counts and metadata [J] . Moustafa Shokrof, C. Titus Brown, Tamer A. Mansour BMC Bioinformatics . 2021,第1期

机译：MQF和Buffered MQF：商用过滤器，用于高效存储K-MERS及其计数和元数据
2. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers [J] . Carl Kingsford Bioinformatics . 2011,第6期

机译：快速，无锁的方法，可有效地并行计算k-mers的出现
3. Efficient Design of Compact Unstructured RNA Libraries Covering All k-mers [J] . Orenstein Yaron, Berger Bonnie Journal of computational biology: A journal of computational molecular cell biology . 2016,第2期

机译：涵盖所有k-mers的紧凑型非结构化RNA文库的高效设计
4. Efficient Counting of k-mers and Spaced Seeds to Speed-Up Alignment-Free Methods [C] . Cinzia Pizzi Conference on Computability in Europe . 2020

机译：有效计数k-mers和间隔种子以加快无比对方法
5. Using k-mer Abundance to Identify Crassphage in Fecal Metagenomes [D] . Parlan, Sabrina A. 2018

机译：利用k-mer丰度鉴定粪便基因组中的巨噬细胞
6. MQF and buffered MQF: quotient filters for efficient storage of k-mers with their counts and metadata [O] . Moustafa Shokrof, C. Titus Brown, Tamer A. Mansour 2021

机译：MQF和BUFFERDED MQF：商品过滤器用于以其计数和元数据的高效存储K-MERS
7. These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure. [O] . Qingpeng Zhang, Jason Pell, Rosangela Canino-Koning, 2014

机译：这些不是您正在寻找的k-mers：使用概率数据结构进行有效的在线k-mer计数。

These Are Not the K-mers You Are Looking For: Efficient Online K-mer Counting Using a Probabilistic Data Structure

摘要

著录项

相似文献

相关主题

期刊订阅