Discovering almost any hidden motif from multiple sequences

Fu B.; Kao M.-Y.; Wang L.

首页> 外文期刊>ACM transactions on algorithms >Discovering almost any hidden motif from multiple sequences

【24h】

Discovering almost any hidden motif from multiple sequences

机译：从多个序列中发现几乎所有隐藏的主题

获取原文

获取原文并翻译 | 示例

获取外文期刊封面目录资料

开具论文收录证明 >>

文献代查 >>

文献数据库（团队版） >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

We study a natural probabilistic model for motif discovery. In this model, there are k background sequences, and each character in a background sequence is a random character from an alphabet ∑. A motif G = g_1g _2 ? g_m is a string of m characters. Each background sequence is implanted with a probabilistically generated approximate copy of G. For a probabilistically generated approximate copy b_1b_2 ? bm of G, every character is probabilistically generated such that the probability for b_i ≠ g_i is at most α. In this article, we develop an efficient algorithm that can discover a hidden motif from a set of sequences for any alphabet ∑ with |∑| ≥ 2 and is applicable to DNA motif discovery. We prove that for α < 1/8 (1 - 1/|∑|), there exist positive constants c_0, ε, and δ_2 such that if there are at least c0 logn input sequences, then in O(n~2/h (log n)~(O(1))) time this algorithm finds the motif with probability at least 3/4 for every G ε ∑~p - ψ_p,h,ε (∑), where n the length of longest sequences, p is the length of the motif, h is a parameter with p ≥ 4h ≥ δ_2 log n, and εp,h,ε (∑) is a small subset of at most 2-Θ(ε~2h) fraction of the sequences in ∑p.

机译：我们研究了自然的概率模型进行主题发现。在该模型中，有k个背景序列，并且背景序列中的每个字符都是来自字母∑的随机字符。图案G = g_1g _2吗？ g_m是m个字符的字符串。每个背景序列都植入了概率生成的G近似副本。 b G的概率，每个字符都是概率生成的，因此b_i≠g_i的概率最大为α。在本文中，我们开发了一种有效的算法，该算法可以从序列集中发现任何字母Σ带有|||的隐藏主题。 ≥2，适用于DNA基序发现。我们证明，对于α<1/8（1 /-|| ∑ |），存在正常数c_0，ε和δ_2，使得如果至少有c0 logn个输入序列，则为O（n〜2 / h）（log n）〜（O（1）））时间，此算法针对每个Gε∑〜p-ψ_p，h，ε（∑）找到概率至少为3/4的主题，其中n最长序列的长度， p是基序的长度，h是p≥4h≥δ_2log n的参数，εp，h，ε（∑）是序列中最多2-Θ（ε〜2h）个分数的小子集∑p。

著录项

来源
《ACM transactions on algorithms》 |2011年第2期|共18页
作者
Fu B.; Kao M.-Y.; Wang L.;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类数学;
关键词
Complexity; Motif detection; Probabilistic analysis; Probability;

机译：复杂度;基元检测;概率分析;概率;

相似文献

外文文献
中文文献
专利

1. Discovering almost any hidden motif from multiple sequences [J] . Fu B., Kao M.-Y., Wang L. ACM transactions on algorithms . 2011,第2期

机译：从多个序列中发现几乎所有隐藏的主题
2. ARCS-Motif: discovering correlated motifs from unaligned biological sequences [J] . Shijie Zhang Wei Su and Jiong Yang* Bioinformatics . 2009,第2期

机译：ARCS主题：从未比对的生物序列中发现相关的基序
3. Discovering short linear protein motif based on selective training of profile hidden Markov models [J] . Song Tao, Gu Hong Journal of Theoretical Biology . 2015,第Null期

机译：基于轮廓隐马尔可夫模型的选择性训练发现短线性蛋白基序
4. Discovering Almost Any Hidden Motif from Multiple Sequences in Polynomial Time with Low Sample Complexity and High Success Probability [C] . Bin Fu, Ming-Yang Kao, Lusheng Wang Theory and application of models of computation . 2009

机译：在多项式时间内从多个序列中发现几乎所有隐藏的基元，且样本复杂度低且成功几率高
5. Discovering motifs in DNA and protein sequences: The approximate common substring problem. [D] . Bailey, Timothy Lawrence. 1995

机译：在DNA和蛋白质序列中发现基序：近似的常见子串问题。
6. Using hidden Markov models to investigate G-quadruplex motifs in genomic sequences [O] . Masato Yano, Yuki Kato 2014

机译：使用隐马尔可夫模型研究基因组序列中的G-四链体基序
7. Figure 4: (A) One conserved sequence, which occurs 79 times in 46,264 binding site peaks from the ChIP-seq data-set. The mutation profile of this conserved sequence is illustrated, where ’_ ’ indicates this base is unchanged; DEL indicates this base is lost; INS X indicates a new base X is inserted in front of this base. (B) Several repeated elements patterns are listed. (C) In the first column, the top five DNA motifs, mined by meme-chip tools (Machanick Bailey, 2011) are illustrated. The resemblant conserved sequences, found by the CFSP algorithm are listed in the second column. In the third column, the position-specific scoring matrices, which are transformed from mutational information are listed. The similarity between meme motif and resemblant conserved sequence with PSSM format was calculated via a stamp motif comparison tool (Mahony Benos, 2007). The E-values for the similarity of those pairs is displayed in the fourth column. (D) One motif is selected in each group clustered by gkmsvm descriptors, and the corresponding motif found by the CFSP algorithm is listed below. (E) There are additional datasets (File No: ENCFF100GRL, ENCFF616IRT, ENCFF870CER, Target: SREBF1) collected from https://www.encodeproject.org. The top two motifs are selected in each file using meme tools, and the corresponding motifs found by our algorithm are listed below. [O] . -1

机译：图4：（a）一种保守序列，其发生在芯片-SEQ数据集中的46,264个结合位点峰值中的79倍。说明了这种保守序列的突变分布，其中'_'表示该碱度不变; del表示此基础丢失; INS X表示新的基础X插入此基础前面。（b）列出了几种重复的元素模式。（c）在第一栏中，示出了由MEME芯片工具（Machanick＆Bailey，2011）开采的前五个DNA主题。由CFSP算法发现的相应保守序列列于第二列中。在第三列中，列出了从突变信息转换的特定位置的评分矩阵。 MEME主题与PSSM格式的相似性与PSSM格式之间的相似性通过邮票图章比较工具（Mahony＆Benos，2007）计算。这些对相似性的电子值显示在第四列中。（d）在由GKMSVM描述符聚集的每个组中选择了一个图案，下面列出了CFSP算法的相应主题。（e）从https://www.encodeproject.org收集的，有附加数据集（文件no：cernff100grl，cenf616irl，conf8.20cer，target：srebf1）。使用MEME工具在每个文件中选择前两个图案，并且我们的算法发现的相应主题如下所示。

Discovering almost any hidden motif from multiple sequences

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅