首页> 美国卫生研究院文献>PLoS Clinical Trials >The effects of sampling on the efficiency and accuracy of k−mer indexes: Theoretical and empirical comparisons using the human genome

【2h】

The effects of sampling on the efficiency and accuracy of k−mer indexes: Theoretical and empirical comparisons using the human genome

机译：采样对k-mer索引的效率和准确性的影响：使用人类基因组的理论和经验比较

代理获取

本网站仅为用户提供外文OA文献查询和代理获取服务，本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文，但由于OA文献来源多样且变更频繁，仍可能出现获取不到、文献不完整或与标题不符等情况，如果获取不到我们将提供退款服务。请知悉。

页面导航

摘要
著录项
相似文献
相关主题

摘要

One of the most common ways to search a sequence database for sequences that are similar to a query sequence is to use a k-mer index such as BLAST. A big problem with k-mer indexes is the space required to store the lists of all occurrences of all k-mers in the database. One method for reducing the space needed, and also query time, is sampling where only some k-mer occurrences are stored. Most previous work uses hard sampling, in which enough k-mer occurrences are retained so that all similar sequences are guaranteed to be found. In contrast, we study soft sampling, which further reduces the number of stored k-mer occurrences at a cost of decreasing query accuracy. We focus on finding highly similar local alignments (HSLA) over nucleotide sequences, an operation that is fundamental to biological applications such as cDNA sequence mapping. For our comparison, we use the NCBI BLAST tool with the human genome and human ESTs. When identifying HSLAs, we find that soft sampling significantly reduces both index size and query time with relatively small losses in query accuracy. For the human genome and HSLAs of length at least 100 bp, soft sampling reduces index size 4-10 times more than hard sampling and processes queries 2.3-6.8 times faster, while still achieving retention rates of at least 96.6%. When we apply soft sampling to the problem of mapping ESTs against the genome, we map more than 98% of ESTs perfectly while reducing the index size by a factor of 4 and query time by 23.3%. These results demonstrate that soft sampling is a simple but effective strategy for performing efficient searches for HSLAs. We also provide a new model for sampling with BLAST that predicts empirical retention rates with reasonable accuracy by modeling two key problem factors.

机译：在序列数据库中搜索与查询序列相似的序列的最常见方法之一是使用k-mer索引，例如BLAST。 k-mer索引的一个大问题是在数据库中存储所有k-mer的所有实例的列表所需的空间。减少所需空间以及减少查询时间的一种方法是采样，仅存储一些k-mer出现。以前的大多数工作都使用硬采样，其中保留了足够多的k-mer出现，因此可以保证找到所有相似的序列。相比之下，我们研究软采样，它以降低查询准确性的代价进一步减少了存储的k-mer出现次数。我们着重于在核苷酸序列上发现高度相似的局部比对（HSLA），这是生物学应用（如cDNA序列定位）的基础操作。为了进行比较，我们将NCBI BLAST工具与人类基因组和人类EST结合使用。识别HSLA时，我们发现软采样可以显着减少索引大小和查询时间，而查询准确性的损失相对较小。对于人类基因组和长度至少为100 bp的HSLA，软采样将索引大小比硬采样减少了4-10倍，并且处理查询的速度加快了2.3-6.8倍，而保留率至少达到96.6％。当我们将软采样应用于将EST映射到基因组的问题时，我们可以完美地映射98％以上的EST，同时将索引大小减少4倍，并将查询时间减少23.3％。这些结果表明，软采样是执行HSLA高效搜索的简单但有效的策略。我们还提供了一个用于BLAST采样的新模型，该模型通过对两个关键问题因素进行建模，以合理的准确性预测了经验保留率。

著录项

期刊名称 PLoS Clinical Trials
作者
Meznah Almutairy; Eric Torng;
展开▼
作者单位

展开▼
年(卷),期 2011(12),7
年度 2011
页码 e0179046
总页数 23
原文格式 PDF
正文语种
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. A Review on the Theoretical and Empirical Efficiency Comparisons of Some Ratio and Product Type Mean Estimators in Two Phase Sampling Scheme [J] . ?zge Akku? American Journal of Mathematics and Statistics . 2016,第1期

机译：两相采样方案中某些比率和乘积类型均值估计器的理论和经验效率比较的综述
2. Application of Malmquist Indexes, Empirical Model and Data Envelopment Analysis: A Measure of Performance and Efficiency of Commercial Banks in Taiwan [J] . British Journal of Economics, Management & Trade . 2013,第3期

机译：Malmquist指数，经验模型和数据包络分析的应用：台湾商业银行绩效和效率的度量
3. Interspecies hybridization on DNA resequencing microarrays: efficiency of sequence recovery and accuracy of SNP detection in human, ape, and codfish mitochondrial DNA genomes sequenced on a human-specific MitoChip [J] . Sarah MC Flynn, Steven M Carr BMC Genomics . 2007,第1期

机译：在DNA重测序微阵列上进行种间杂交：在人类特异性MitoChip上测序的人类，猿和鳕鱼线粒体DNA基因组中的序列恢复效率和SNP检测的准确性
4. Silent Mutation Effects on Translation Efficiency in Human Cancer Genome [C] . Gu Wanjun, Liang Wei 5th International Conference on Bioinformatics and Biomedical Engineering . 2011

机译：沉默突变对人类癌症基因组翻译效率的影响
5. Studying the Effects of Sampling on the Efficiency and Accuracy of k-mer Indexes [D] . Almutairy, Meznah. 2017

机译：研究抽样对k-mer索引的效率和准确性的影响
6. Sampling Methodologies for Epidemiologic Surveillance of Men Who Have Sex with Men and Transgender Women in Latin America: An Empiric Comparison of Convenience Sampling Time Space Sampling and Respondent Driven Sampling [O] . J. L. Clark, K. A. Konda, A. Silva-Santisteban, -1

机译：拉丁美洲男男性行为者和变性女性的流行病学监测抽样方法：便捷抽样时空抽样和受访者驱动抽样的经验比较
7. The effects of sampling on the efficiency and accuracy of k-mer indexes: Theoretical and empirical comparisons using the human genome. [O] . Meznah Almutairy, Eric Torng 2017

机译：抽样对k-mer指数效率和准确性的影响：使用人类基因组的理论和实证比较。

The effects of sampling on the efficiency and accuracy of k−mer indexes: Theoretical and empirical comparisons using the human genome

摘要

著录项

相似文献

相关主题

期刊订阅