首页> 外文期刊>Journal of computational biology: A journal of computational molecular cell biology >Separating Significant Matches from Spurious Matches in DNA Sequences
【24h】

Separating Significant Matches from Spurious Matches in DNA Sequences

机译:从DNA序列中的假匹配中分离出重要的匹配

获取原文
获取原文并翻译 | 示例

摘要

Word matches are widely used to compare genomic sequences. Complete genome alignment methods often rely on the use of matches as anchors for building their alignments, and various alignment-free approaches that characterize similarities between large sequences are based on word matches. Among matches that are retrieved from the comparison of two genomic sequences, a part of them may correspond to spurious matches (SMs), which are matches obtained by chance rather than by homologous relationships. The number of SMs depends on the minimal match length (?) that has to be set in the algorithm used to retrieve them. Indeed, if ? is too small, a lot of matches are recovered but most of them are SMs. Conversely, if ? is too large, fewer matches are retrieved but many smaller significant matches are certainly ignored. To date, the choice of ? mostly depends on empirical threshold values rather than robust statistical methods. To overcome this problem, we propose a statistical approach based on the use of a mixture model of geometric distributions to characterize the distribution of the length of matches obtained from the comparison of two genomic sequences.
机译:单词匹配被广泛用于比较基因组序列。完整的基因组比对方法通常依靠使用匹配作为锚来建立其比对,而表征大序列之间相似性的各种无比对方法则基于单词匹配。在从两个基因组序列的比较中检索到的匹配中,它们的一部分可能与伪匹配(SM)相对应,它们是偶然获得的匹配,而不是同源关系。 SM的数量取决于用于检索它们的算法中必须设置的最小匹配长度(?)。确实,如果?太小,可以恢复很多匹配,但大多数都是SM。相反,如果?太大,将检索到较少的匹配项,但肯定会忽略许多较小的有效匹配项。迄今为止,选择?主要取决于经验阈值,而不是可靠的统计方法。为了克服这个问题,我们提出了一种基于几何分布混合模型的统计方法,以表征通过比较两个基因组序列获得的匹配长度的分布。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号