首页> 美国卫生研究院文献>other >These Are Not the K-mers You Are Looking For: Efficient Online K-mer Counting Using a Probabilistic Data Structure
【2h】

These Are Not the K-mers You Are Looking For: Efficient Online K-mer Counting Using a Probabilistic Data Structure

机译:这些不是您要找的K-mer:使用概率数据结构的高效在线K-mer计数

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

K-mer abundance analysis is widely used for many purposes in nucleotide sequence analysis, including data preprocessing for de novo assembly, repeat detection, and sequencing coverage estimation. We present the khmer software package for fast and memory efficient online counting of k-mers in sequencing data sets. Unlike previous methods based on data structures such as hash tables, suffix arrays, and trie structures, khmer relies entirely on a simple probabilistic data structure, a Count-Min Sketch. The Count-Min Sketch permits online updating and retrieval of k-mer counts in memory which is necessary to support online k-mer analysis algorithms. On sparse data sets this data structure is considerably more memory efficient than any exact data structure. In exchange, the use of a Count-Min Sketch introduces a systematic overcount for k-mers; moreover, only the counts, and not the k-mers, are stored. Here we analyze the speed, the memory usage, and the miscount rate of khmer for generating k-mer frequency distributions and retrieving k-mer counts for individual k-mers. We also compare the performance of khmer to several other k-mer counting packages, including Tallymer, Jellyfish, BFCounter, DSK, KMC, Turtle and KAnalyze. Finally, we examine the effectiveness of profiling sequencing error, k-mer abundance trimming, and digital normalization of reads in the context of high khmer false positive rates. khmer is implemented in C++ wrapped in a Python interface, offers a tested and robust API, and is freely available under the BSD license at github.com/ged-lab/khmer.
机译:K-mer丰度分析在核苷酸序列分析中被广泛用于许多目的,包括从头组装的数据预处理,重复检测和测序覆盖率估计。我们提供了khmer软件包,用于对测序数据集中的k-mers进行快速和高效的内存在线计数。与先前基于哈希表,后缀数组和trie结构等数据结构的方法不同,高棉完全依赖于简单的概率数据结构,即Count-Min Sketch。 Count-Min Sketch允许在线更新和检索内存中的k-mer计数,这是支持在线k-mer分析算法所必需的。在稀疏数据集上,此数据结构比任何确切的数据结构都具有更高的内存效率。作为交换,使用Count-Min Sketch会导致k-mers的系统计数过高。而且,仅存储计数,而不存储k聚体。在这里,我们分析了高棉的速度,内存使用情况和高棉错记率,以生成k-mer频率分布并检索单个k-mer的k-mer计数。我们还将高棉与其他几种高聚物计数软件包的性能进行了比较,包括Tallymer,水母,BFCounter,DSK,KMC,Turtle和KAnalyze。最后,我们检查了在高棉假阳性率较高的情况下,分析测序错误,k-mer丰度修剪和读数的数字归一化的有效性。高棉语是用封装在Python接口中的C ++实现的,提供了经过测试的强大API,并且可以通过github.com/ged-lab/khmer的BSD许可免费获得。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号