首页> 外文会议>Theory and application of models of computation >Discovering Almost Any Hidden Motif from Multiple Sequences in Polynomial Time with Low Sample Complexity and High Success Probability

【24h】

Discovering Almost Any Hidden Motif from Multiple Sequences in Polynomial Time with Low Sample Complexity and High Success Probability

机译：在多项式时间内从多个序列中发现几乎所有隐藏的基元，且样本复杂度低且成功几率高

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

We study a natural probabilistic model for motif discovery that has been used to experimentally test the effectiveness of motif discovery programs. In this model, there are k background sequences, and each character in a background sequence is a random character from an alphabet ∑. A motif G = g1g2... gm is a string of m characters. Each background sequence is implanted a probabilistically generated approximate copy of G. For a probabilistically generated approximate copy b_1b_2...b_m of G, every character is probabilistically generated such that the probability for b_i ≠ g_i is at most α. It has been conjectured that multiple background sequences can help with finding faint motifs G.rnIn this paper, we develop an efficient algorithm that can discover a hidden motif from a set of sequences for any alphabet ∑ with |∑| ≥ 2 and is applicable to DNA motif discovery. We prove that for α < 1/4(1 -1/|∑|) and any constant x ≥ 8, there exist positive constants c_0,∈,δ_1 and δ_2 such that if the length ρ of motif G is at least δ_1 log n, and there are k ≥ c_0 log n input sequences, then in O(n~2 + kn) time this algorithm finds the motif with probability at least 1 -1/2~x for every G ∈ ∑~ρ -ψ_(ρ,h,∈)(∑), where ρ is the length of the motif, h is a parameter with ρ ≥ 4h ≥ δ_2 log n, and ψ_(ρ,h,∈)(∑) is a small subset of at most 2~(-Θ(∈~2h)) fraction of the sequences in ∑~ρ. The constants c_0,∈,δ_1 and δ_2 do not depend on x when x is a parameter of order O(log n). Our algorithm can take any number k sequences as input.

机译：我们研究了用于主题发现的自然概率模型，该模型已用于通过实验测试主题发现程序的有效性。在该模型中，有k个背景序列，并且背景序列中的每个字符都是来自字母∑的随机字符。图案G = g1g2 ... gm是一串m个字符。每个背景序列都植入了一个概率生成的G近似副本。对于一个概率生成的G近似副本b_1b_2 ... b_m，每个字符都被概率生成，使得b_i≠g_i的概率最大为α。据推测，多个背景序列可以帮助找到模糊的基序G.rn。在本文中，我们开发了一种有效的算法，该算法可以从序列集中为||||的任何字母∑发现隐藏的基序。 ≥2，适用于DNA基序发现。我们证明对于α<1/4（1 -1 / | ∑ |）且任何常数x≥8，都存在正常数c_0，∈，δ_1和δ_2，使得如果图案G的长度ρ至少为δ_1log n，并且有k≥c_0个log n个输入序列，然后在O（n〜2 + kn）时间内，对于每个G∈∑〜ρ-ψ_（ ρ，h，∈）（∑），其中ρ是主题的长度，h是ρ≥4h≥δ_2log n的参数，ψ_（ρ，h，∈）（∑）是at的小子集∑〜ρ中序列的最多2〜（-Θ（∈〜2h））个分数。当x是阶数O（log n）的参数时，常数c_0，ε，δ_1和δ_2不依赖于x。我们的算法可以将任意数量的k序列作为输入。

著录项

来源
《Theory and application of models of computation》|2009年|231-240|共10页
会议地点 Changsha(CN);Changsha(CN);Changsha(CN)
作者
Bin Fu; Ming-Yang Kao; Lusheng Wang;
展开▼
作者单位

Dept. of Computer Science, University of Texas - Pan American TX 78539, USA;

Department of Electrical Engineering and Computer Science,Northwestern University, Evanston, IL 60208, USA;

Department of Computer Science, The City University of Hong Kong,Kowloon, Hong Kong;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类计算技术、计算机技术;
关键词

相似文献

外文文献
中文文献
专利

1. Discovering almost any hidden motif from multiple sequences [J] . Fu B., Kao M.-Y., Wang L. ACM transactions on algorithms . 2011,第2期

机译：从多个序列中发现几乎所有隐藏的主题
2. ARCS-Motif: discovering correlated motifs from unaligned biological sequences [J] . Zhang S, Su W, Yang J Bioinformatics . 2009,第2期

机译：ARCS主题：从未比对的生物序列中发现相关的基序
3. ARCS-Motif: discovering correlated motifs from unaligned biological sequences [J] . Shijie Zhang Wei Su and Jiong Yang* Bioinformatics . 2009,第2期

机译：ARCS主题：从未比对的生物序列中发现相关的基序
4. Discovering Almost Any Hidden Motif from Multiple Sequences in Polynomial Time with Low Sample Complexity and High Success Probability [C] . Bin Pu, Ming-Yang Kao, Lusheng Wang Annual Conference on Theory and Applications of Models of Computation . 2009

机译：从多项式时间中发现几乎任何隐藏的主题，具有低样本复杂度和高成功概率
5. Discovering motifs in DNA and protein sequences: The approximate common substring problem. [D] . Bailey, Timothy Lawrence. 1995

机译：在DNA和蛋白质序列中发现基序：近似的常见子串问题。
6. Reconstructing phylogenies from noisy quartets in polynomial time with a high success probability [O] . Gang Wu, Ming-Yang Kao, Guohui Lin, 2008

机译：在多项式时间内从嘈杂的四重奏重构系统发育的可能性很高
7. Figure 4: (A) One conserved sequence, which occurs 79 times in 46,264 binding site peaks from the ChIP-seq data-set. The mutation profile of this conserved sequence is illustrated, where ’_ ’ indicates this base is unchanged; DEL indicates this base is lost; INS X indicates a new base X is inserted in front of this base. (B) Several repeated elements patterns are listed. (C) In the first column, the top five DNA motifs, mined by meme-chip tools (Machanick Bailey, 2011) are illustrated. The resemblant conserved sequences, found by the CFSP algorithm are listed in the second column. In the third column, the position-specific scoring matrices, which are transformed from mutational information are listed. The similarity between meme motif and resemblant conserved sequence with PSSM format was calculated via a stamp motif comparison tool (Mahony Benos, 2007). The E-values for the similarity of those pairs is displayed in the fourth column. (D) One motif is selected in each group clustered by gkmsvm descriptors, and the corresponding motif found by the CFSP algorithm is listed below. (E) There are additional datasets (File No: ENCFF100GRL, ENCFF616IRT, ENCFF870CER, Target: SREBF1) collected from https://www.encodeproject.org. The top two motifs are selected in each file using meme tools, and the corresponding motifs found by our algorithm are listed below. [O] . -1

机译：图4：（a）一种保守序列，其发生在芯片-SEQ数据集中的46,264个结合位点峰值中的79倍。说明了这种保守序列的突变分布，其中'_'表示该碱度不变; del表示此基础丢失; INS X表示新的基础X插入此基础前面。（b）列出了几种重复的元素模式。（c）在第一栏中，示出了由MEME芯片工具（Machanick＆Bailey，2011）开采的前五个DNA主题。由CFSP算法发现的相应保守序列列于第二列中。在第三列中，列出了从突变信息转换的特定位置的评分矩阵。 MEME主题与PSSM格式的相似性与PSSM格式之间的相似性通过邮票图章比较工具（Mahony＆Benos，2007）计算。这些对相似性的电子值显示在第四列中。（d）在由GKMSVM描述符聚集的每个组中选择了一个图案，下面列出了CFSP算法的相应主题。（e）从https://www.encodeproject.org收集的，有附加数据集（文件no：cernff100grl，cenf616irl，conf8.20cer，target：srebf1）。使用MEME工具在每个文件中选择前两个图案，并且我们的算法发现的相应主题如下所示。
8. Re-Assessment of Road Accident Data-Analysis Policy: Applying Theory from Involuntary, High-Consequence, Low-Probability Events like Nuclear Power Plant Meltdowns to Voluntary, Low-Consequence, High-Probability Events like Traffic Accidents. [R] . Naveh, E., Marcus, A. 2002

机译：道路交通事故数据的重新评估 - 分析政策：将核电厂危机等高概率，低概率事件的理论应用于交通事故等自愿，低后果，高概率事件。

Discovering Almost Any Hidden Motif from Multiple Sequences in Polynomial Time with Low Sample Complexity and High Success Probability

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅