首页> 美国卫生研究院文献>PLoS Clinical Trials >The effects of sampling on the efficiency and accuracy of k−mer indexes: Theoretical and empirical comparisons using the human genome
【2h】

The effects of sampling on the efficiency and accuracy of k−mer indexes: Theoretical and empirical comparisons using the human genome

机译:采样对k-mer索引的效率和准确性的影响:使用人类基因组的理论和经验比较

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

One of the most common ways to search a sequence database for sequences that are similar to a query sequence is to use a k-mer index such as BLAST. A big problem with k-mer indexes is the space required to store the lists of all occurrences of all k-mers in the database. One method for reducing the space needed, and also query time, is sampling where only some k-mer occurrences are stored. Most previous work uses hard sampling, in which enough k-mer occurrences are retained so that all similar sequences are guaranteed to be found. In contrast, we study soft sampling, which further reduces the number of stored k-mer occurrences at a cost of decreasing query accuracy. We focus on finding highly similar local alignments (HSLA) over nucleotide sequences, an operation that is fundamental to biological applications such as cDNA sequence mapping. For our comparison, we use the NCBI BLAST tool with the human genome and human ESTs. When identifying HSLAs, we find that soft sampling significantly reduces both index size and query time with relatively small losses in query accuracy. For the human genome and HSLAs of length at least 100 bp, soft sampling reduces index size 4-10 times more than hard sampling and processes queries 2.3-6.8 times faster, while still achieving retention rates of at least 96.6%. When we apply soft sampling to the problem of mapping ESTs against the genome, we map more than 98% of ESTs perfectly while reducing the index size by a factor of 4 and query time by 23.3%. These results demonstrate that soft sampling is a simple but effective strategy for performing efficient searches for HSLAs. We also provide a new model for sampling with BLAST that predicts empirical retention rates with reasonable accuracy by modeling two key problem factors.
机译:在序列数据库中搜索与查询序列相似的序列的最常见方法之一是使用k-mer索引,例如BLAST。 k-mer索引的一个大问题是在数据库中存储所有k-mer的所有实例的列表所需的空间。减少所需空间以及减少查询时间的一种方法是采样,仅存储一些k-mer出现。以前的大多数工作都使用硬采样,其中保留了足够多的k-mer出现,因此可以保证找到所有相似的序列。相比之下,我们研究软采样,它以降低查询准确性的代价进一步减少了存储的k-mer出现次数。我们着重于在核苷酸序列上发现高度相似的局部比对(HSLA),这是生物学应用(如cDNA序列定位)的基础操作。为了进行比较,我们将NCBI BLAST工具与人类基因组和人类EST结合使用。识别HSLA时,我们发现软采样可以显着减少索引大小和查询时间,而查询准确性的损失相对较小。对于人类基因组和长度至少为100 bp的HSLA,软采样将索引大小比硬采样减少了4-10倍,并且处理查询的速度加快了2.3-6.8倍,而保留率至少达到96.6%。当我们将软采样应用于将EST映射到基因组的问题时,我们可以完美地映射98%以上的EST,同时将索引大小减少4倍,并将查询时间减少23.3%。这些结果表明,软采样是执行HSLA高效搜索的简单但有效的策略。我们还提供了一个用于BLAST采样的新模型,该模型通过对两个关键问题因素进行建模,以合理的准确性预测了经验保留率。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号