首页> 外文会议>Theory and application of models of computation >Discovering Almost Any Hidden Motif from Multiple Sequences in Polynomial Time with Low Sample Complexity and High Success Probability
【24h】

Discovering Almost Any Hidden Motif from Multiple Sequences in Polynomial Time with Low Sample Complexity and High Success Probability

机译:在多项式时间内从多个序列中发现几乎所有隐藏的基元,且样本复杂度低且成功几率高

获取原文
获取原文并翻译 | 示例

摘要

We study a natural probabilistic model for motif discovery that has been used to experimentally test the effectiveness of motif discovery programs. In this model, there are k background sequences, and each character in a background sequence is a random character from an alphabet ∑. A motif G = g1g2... gm is a string of m characters. Each background sequence is implanted a probabilistically generated approximate copy of G. For a probabilistically generated approximate copy b_1b_2...b_m of G, every character is probabilistically generated such that the probability for b_i ≠ g_i is at most α. It has been conjectured that multiple background sequences can help with finding faint motifs G.rnIn this paper, we develop an efficient algorithm that can discover a hidden motif from a set of sequences for any alphabet ∑ with |∑| ≥ 2 and is applicable to DNA motif discovery. We prove that for α < 1/4(1 -1/|∑|) and any constant x ≥ 8, there exist positive constants c_0,∈,δ_1 and δ_2 such that if the length ρ of motif G is at least δ_1 log n, and there are k ≥ c_0 log n input sequences, then in O(n~2 + kn) time this algorithm finds the motif with probability at least 1 -1/2~x for every G ∈ ∑~ρ -ψ_(ρ,h,∈)(∑), where ρ is the length of the motif, h is a parameter with ρ ≥ 4h ≥ δ_2 log n, and ψ_(ρ,h,∈)(∑) is a small subset of at most 2~(-Θ(∈~2h)) fraction of the sequences in ∑~ρ. The constants c_0,∈,δ_1 and δ_2 do not depend on x when x is a parameter of order O(log n). Our algorithm can take any number k sequences as input.
机译:我们研究了用于主题发现的自然概率模型,该模型已用于通过实验测试主题发现程序的有效性。在该模型中,有k个背景序列,并且背景序列中的每个字符都是来自字母∑的随机字符。图案G = g1g2 ... gm是一串m个字符。每个背景序列都植入了一个概率生成的G近似副本。对于一个概率生成的G近似副本b_1b_2 ... b_m,每个字符都被概率生成,使得b_i≠g_i的概率最大为α。据推测,多个背景序列可以帮助找到模糊的基序G.rn。在本文中,我们开发了一种有效的算法,该算法可以从序列集中为||||的任何字母∑发现隐藏的基序。 ≥2,适用于DNA基序发现。我们证明对于α<1/4(1 -1 / | ∑ |)且任何常数x≥8,都存在正常数c_0,∈,δ_1和δ_2,使得如果图案G的长度ρ至少为δ_1log n,并且有k≥c_0个log n个输入序列,然后在O(n〜2 + kn)时间内,对于每个G∈∑〜ρ-ψ_( ρ,h,∈)(∑),其中ρ是主题的长度,h是ρ≥4h≥δ_2log n的参数,ψ_(ρ,h,∈)(∑)是at的小子集∑〜ρ中序列的最多2〜(-Θ(∈〜2h))个分数。当x是阶数O(log n)的参数时,常数c_0,ε,δ_1和δ_2不依赖于x。我们的算法可以将任意数量的k序列作为输入。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号