首页> 外文会议>Asia-Pacific Bioinformatics Conference >ALL HITS ALL THE TIME:PARAMETER FREE CALCULATION OF SEED SENSITIVITY
【24h】

ALL HITS ALL THE TIME:PARAMETER FREE CALCULATION OF SEED SENSITIVITY

机译:所有时间都击中:参数自由计算种子敏感性

获取原文

摘要

Standard search techniques for DNA repeats start by identifying seeds, that is, small matching words, that may inhabit larger repeats. Recent innovations in seed structure have led to the development of spaced seeds [8] and indel seeds [9] which are more sensitive than contiguous seeds (also known as k-mers, k-tuples, 1-words, etc.). Evaluating seed sensitivity requires 1) specifying a homology model which describes types of alignments that can occur between two copies of a repeat, and 2) assigning probabilities to those alignments. Optimal seed selection is a resource intensive activity because essentially all alternative seeds must be tested [7]. Current methods require that the model and probability parameters be specified in advance. When the parameters change, the entire calculation has to be rerun. In this paper, we show how to eliminate the need for prior parameter specification. The ideas presented follow from a simple observation: given a homology model, the alignments hit by a particularseed remain the same regardless of the probability parameters. Only the weights assigned to those alignments change. Therefore, if we know all the hits, we can easily (and quickly) find optimal seeds. We describe a highly efficient preprocessing step, which is computed just once for each seed. In this calculation, strings which represent possible alignments are unweighted by any probability parameters. Then we show several increasingly efficient methods to find the optimal seed when given specific probability parameters. Indeed, we show how to determine exactly which seeds can never be optimal under any set of probability parameters. This leads to the startling observation that out of thousands of seeds, only a handful have any chance of being optimal.We then show how to find optimal seeds and the boundaries within probability space where they are optimal. We expect this method to greatly facilitate the study of seed space sensitivity, construction of multiple seed sets, and the use of alternative definitions of optimality.
机译:DNA的标准搜索技术通过识别种子来重复,即可能居住更大的重复的小匹配词。最近的种子结构的创新导致了间隔的种子[8]和吲哚种子[9]比连续种子更敏感(也称为K-MERS,K元组,1字等)。评估种子灵敏度需要1)指定描述可以在重复的两个副本和2)分配给这些对齐之间的对准类型的同源性模型。最佳种子选择是一种资源密集型活动,因为必须测试所有替代种子[7]。当前方法要求预先指定模型和概率参数。当参数发生变化时,整个计算必须重新运行。在本文中,我们展示了如何消除对先前参数规范的需求。遵循的想法从简单的观察开始:给定同源模型,无论概率参数如何,由特定的对齐保持相同。只分配给这些对齐的重量也会发生变化。因此,如果我们知道所有的命中,我们可以轻松地(并迅速)找到最佳种子。我们描述了一种高效的预处理步骤,每种种子仅计算一次。在该计算中,表示可能对准的字符串是由任何概率参数的减速的。然后我们展示了多个越来越有效的方法,以在给定特定概率参数时找到最佳种子。实际上,我们展示了如何在任何一组概率参数下确定哪些种子永远不会是最佳的。这导致了令人惊讶的观察,即在成千上万种子中,只有少数几乎没有最佳的机会。然后展示如何找到最佳种子和它们是最佳的概率空间内的边界。我们预计这种方法将大大促进种子空间敏感性,多种子套装的构建以及使用最优性定义的使用。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号