...
首页> 外文期刊>Current genomics >Estimating the k-mer Coverage Frequencies in Genomic Datasets: A Comparative Assessment of the State-of-the-art
【24h】

Estimating the k-mer Coverage Frequencies in Genomic Datasets: A Comparative Assessment of the State-of-the-art

机译:估算基因组数据集中的 k 覆盖率:对最新技术的比较评估

获取原文
   

获取外文期刊封面封底 >>

       

摘要

Background: In bioinformatics, estimation of k-mer abundance histograms or just enumerat-ing the number of unique k-mers and the number of singletons are desirable in many genome sequence analysis applications. The applications include predicting genome sizes, data pre-processing for de Bruijn graph assembly methods (tune runtime parameters for analysis tools), repeat detection, sequenc-ing coverage estimation, measuring sequencing error rates, etc. Different methods for cardinality estima-tion in sequencing data have been developed in recent years. Objective: In this article, we present a comparative assessment of the different k-mer frequency estima-tion programs (ntCard, KmerGenie, KmerStream and Khmer (abundance-dist-single.py and unique-kmers.py) to assess their relative merits and demerits. Methods: Principally, the miscounts/error-rates of these tools are analyzed by rigorous experimental analysis for a varied range of k. We also present experimental results on runtime, scalability for larger datasets, memory, CPU utilization as well as parallelism of k-mer frequency estimation methods. Results: The results indicate that ntCard is more accurate in estimating F0, f1 and full k-mer abundance histograms compared with other methods. ntCard is the fastest but it has more memory requirements compared to KmerGenie. Conclusion: The results of this evaluation may serve as a roadmap to potential users and practitioners of streaming algorithms for estimating k-mer coverage frequencies, to assist them in identifying an appro-priate method. Such results analysis also help researchers to discover remaining open research ques-tions, effective combinations of existing techniques and possible avenues for future research
机译:背景:在生物信息学中,在许多基因组序列分析应用中,需要估计k-mer丰度直方图或仅枚举唯一的k-mer数量和单子数量。这些应用包括预测基因组大小,de Bruijn图组装方法的数据预处理(为分析工具调整运行时参数),重复检测,序列化覆盖估计,测量测序错误率等。不同的基数估计方法近年来已经开发出测序数据。目的:在本文中,我们将对不同的k-mer频率估算程序(ntCard,KmerGenie,KmerStream和Khmer(abundance-dist-single.py和unique-kmers.py))进行比较评估,以评估它们的相对价值。方法:主要是通过对各种k范围进行严格的实验分析来分析这些工具的错误计数/错误率,我们还给出了有关运行时,较大数据集的可伸缩性,内存,CPU利用率以及并行性的实验结果结果:结果表明,与其他方法相比,ntCard可以更准确地估计F0,f1和完整的k-mer丰度直方图,ntCard是最快的方法,但与KmerGenie相比,它具有更多的内存需求。 :此评估的结果可以作为潜在用户和流算法的从业者的路线图,以估计k-mer覆盖率,以帮助他们确定合适的遇见者d这样的结果分析还可以帮助研究人员发现尚待解决的开放研究问题,现有技术的有效组合以及未来研究的可能途径

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号