首页> 外文期刊>ACM transactions on algorithms >Discovering almost any hidden motif from multiple sequences
【24h】

Discovering almost any hidden motif from multiple sequences

机译:从多个序列中发现几乎所有隐藏的主题

获取原文
获取原文并翻译 | 示例
获取外文期刊封面目录资料

摘要

We study a natural probabilistic model for motif discovery. In this model, there are k background sequences, and each character in a background sequence is a random character from an alphabet ∑. A motif G = g_1g _2 ? g_m is a string of m characters. Each background sequence is implanted with a probabilistically generated approximate copy of G. For a probabilistically generated approximate copy b_1b_2 ? bm of G, every character is probabilistically generated such that the probability for b_i ≠ g_i is at most α. In this article, we develop an efficient algorithm that can discover a hidden motif from a set of sequences for any alphabet ∑ with |∑| ≥ 2 and is applicable to DNA motif discovery. We prove that for α < 1/8 (1 - 1/|∑|), there exist positive constants c_0, ε, and δ_2 such that if there are at least c0 logn input sequences, then in O(n~2/h (log n)~(O(1))) time this algorithm finds the motif with probability at least 3/4 for every G ε ∑~p - ψ_p,h,ε (∑), where n the length of longest sequences, p is the length of the motif, h is a parameter with p ≥ 4h ≥ δ_2 log n, and εp,h,ε (∑) is a small subset of at most 2-Θ(ε~2h) fraction of the sequences in ∑p.
机译:我们研究了自然的概率模型进行主题发现。在该模型中,有k个背景序列,并且背景序列中的每个字符都是来自字母∑的随机字符。图案G = g_1g _2吗? g_m是m个字符的字符串。每个背景序列都植入了概率生成的G近似副本。 b G的概率,每个字符都是概率生成的,因此b_i≠g_i的概率最大为α。在本文中,我们开发了一种有效的算法,该算法可以从序列集中发现任何字母Σ带有|||的隐藏主题。 ≥2,适用于DNA基序发现。我们证明,对于α<1/8(1 /-|| ∑ |),存在正常数c_0,ε和δ_2,使得如果至少有c0 logn个输入序列,则为O(n〜2 / h) (log n)〜(O(1)))时间,此算法针对每个Gε∑〜p-ψ_p,h,ε(∑)找到概率至少为3/4的主题,其中n最长序列的长度, p是基序的长度,h是p≥4h≥δ_2log n的参数,εp,h,ε(∑)是序列中最多2-Θ(ε〜2h)个分数的小子集∑p。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号