Discovering Motifs in DNA Sequences: A Candidate Motifs Based Approach

机译：在DNA序列中发现母题：基于候选母题的方法

获取原文

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Motif finding is a classical combinatorial problem in the domain of bioinformatics. Motifs are the small set of immunity gene present in the DNA sequences as a binding site and turn on whenever the organism gets infected. Hence, to identify these motifs for transcription factors is of great biological importance. Recently, this field of bio-informatics has grown significantly and many algorithms have been proposed to solve this problem. However, high complexity is the most challenging aspect of this problem which still grabs the attention of many researchers. This paper presents a proficient algorithm that extracts binding sites in a set of DNA sequences for transcription factors using some operations on the DNA sequences. The motif we work on is of known length, un-gapped and non-mutated. The proposed algorithm does some preprocessing and formulates an adjacency list for finding such sites. Although any two randomly selected sequences can be used for preprocessing, we have used first two sequences as a base for constructing the adjacency lists which is later used for fast detection of common l-mers from both of them. These l-mers are considered as candidate motifs and then checked for its existence in all of the remaining DNA sequences using a sliding window approach. The proposed algorithm CMMF is experimentally validated on millions of DNA sequences. Additionally, the formulation of motif finding algorithm is also applicable to related problems in the field of data mining, pattern detection, etc.

机译：主题发现是生物信息学领域的经典组合问题。基序是存在于DNA序列中作为结合位点的一小部分免疫基因，只要感染了该生物便会打开。因此，鉴定这些转录因子的基序具有重要的生物学意义。近来，生物信息学的领域已显着发展，并且提出了许多算法来解决该问题。但是，高复杂度是此问题最具挑战性的方面，仍然吸引了许多研究人员的注意力。本文提出了一种精巧的算法，该算法使用对DNA序列的某些操作为转录因子提取DNA序列集中的结合位点。我们研究的主题是已知长度的，无间隙且无突变的。所提出的算法进行了一些预处理，并制定了邻接表以查找此类站点。尽管可以将任意两个随机选择的序列用于预处理，但我们已将前两个序列用作构建邻接表的基础，该邻接表随后可用于快速检测这两个序列中的常见I-mer。这些l聚体被认为是候选基序，然后使用滑动窗方法检查其在所有其余DNA序列中是否存在。所提出的算法CMMF在数百万个DNA序列上进行了实验验证。另外，主题发现算法的制定也适用于数据挖掘，模式检测等领域的相关问题。

著录项

来源
《International Conference on Parallel, Distributed and Grid Computing》|2018年|599-604|共6页
会议地点 Solan Himachal Pradesh(IN)
作者
Abhinav Jain; Rajat Parashar; Ashish Kumar Goyal; Prantik Biswas; Suma Dawn; Aparajita Nanda;
展开▼
作者单位

Dept. of CS IT Jaypee Institute of Information Technology Noida India;

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
DNA; Information technology; Proteins; Indexes; Data structures; Microsoft Windows; Conferences;

机译：脱氧核糖核酸;信息技术;蛋白质；索引；数据结构;微软Windows;会议活动;

相似文献

外文文献
中文文献
专利

1. ARCS-Motif: discovering correlated motifs from unaligned biological sequences [J] . Zhang S, Su W, Yang J Bioinformatics . 2009,第2期

机译：ARCS主题：从未比对的生物序列中发现相关的基序
2. ARCS-Motif: discovering correlated motifs from unaligned biological sequences [J] . Shijie Zhang Wei Su and Jiong Yang* Bioinformatics . 2009,第2期

机译：ARCS主题：从未比对的生物序列中发现相关的基序
3. Parallelizing and optimizing a hybrid differential evolution with Pareto tournaments for discovering motifs in DNA sequences [J] . David L. Gonzalez-Alvarez, Miguel A. Vega-Rodriguez, Alvaro Rubio-Largo Journal of supercomputing . 2014,第2期

机译：与Pareto锦标赛并行和优化杂交差异进化，以发现DNA序列中的基序
4. Discovering Motifs in DNA Sequences: A Candidate Motifs Based Approach [C] . Abhinav Jain, Rajat Parashar, Ashish Kumar Goyal, International Conference on Parallel, Distributed and Grid Computing . 2018

机译：在DNA序列中发现图案：基于候选的方法
5. Discovering motifs in DNA and protein sequences: The approximate common substring problem. [D] . Bailey, Timothy Lawrence. 1995

机译：在DNA和蛋白质序列中发现基序：近似的常见子串问题。
6. DLocalMotif: a discriminative approach for discovering local motifs inprotein sequences [O] . Ahmed M. Mehdi, Muhammad Shoaib B. Sehgal, Bostjan Kobe, -1

机译：DLocalMotif：一种发现本地图案的判别方法蛋白质序列
7. Figure 4: (A) One conserved sequence, which occurs 79 times in 46,264 binding site peaks from the ChIP-seq data-set. The mutation profile of this conserved sequence is illustrated, where ’_ ’ indicates this base is unchanged; DEL indicates this base is lost; INS X indicates a new base X is inserted in front of this base. (B) Several repeated elements patterns are listed. (C) In the first column, the top five DNA motifs, mined by meme-chip tools (Machanick Bailey, 2011) are illustrated. The resemblant conserved sequences, found by the CFSP algorithm are listed in the second column. In the third column, the position-specific scoring matrices, which are transformed from mutational information are listed. The similarity between meme motif and resemblant conserved sequence with PSSM format was calculated via a stamp motif comparison tool (Mahony Benos, 2007). The E-values for the similarity of those pairs is displayed in the fourth column. (D) One motif is selected in each group clustered by gkmsvm descriptors, and the corresponding motif found by the CFSP algorithm is listed below. (E) There are additional datasets (File No: ENCFF100GRL, ENCFF616IRT, ENCFF870CER, Target: SREBF1) collected from https://www.encodeproject.org. The top two motifs are selected in each file using meme tools, and the corresponding motifs found by our algorithm are listed below. [O] . -1

机译：图4：（a）一种保守序列，其发生在芯片-SEQ数据集中的46,264个结合位点峰值中的79倍。说明了这种保守序列的突变分布，其中'_'表示该碱度不变; del表示此基础丢失; INS X表示新的基础X插入此基础前面。（b）列出了几种重复的元素模式。（c）在第一栏中，示出了由MEME芯片工具（Machanick＆Bailey，2011）开采的前五个DNA主题。由CFSP算法发现的相应保守序列列于第二列中。在第三列中，列出了从突变信息转换的特定位置的评分矩阵。 MEME主题与PSSM格式的相似性与PSSM格式之间的相似性通过邮票图章比较工具（Mahony＆Benos，2007）计算。这些对相似性的电子值显示在第四列中。（d）在由GKMSVM描述符聚集的每个组中选择了一个图案，下面列出了CFSP算法的相应主题。（e）从https://www.encodeproject.org收集的，有附加数据集（文件no：cernff100grl，cenf616irl，conf8.20cer，target：srebf1）。使用MEME工具在每个文件中选择前两个图案，并且我们的算法发现的相应主题如下所示。
8. Critical DNA flanking sequences of a CpG Oligodeoxynucleotide, but not the 6 base CpG motif, can be replaced with RNA without quantitative or qualitative changes in Toll-like receptor 9-mediated activity [R] . Sen, G., Flora, M., Chattopadhyay, G., 2004

机译：CpG寡脱氧核苷酸的关键DNa侧翼序列，但不是6碱基CpG基序，可被RNa取代，而Toll样受体9介导的活性无定量或定性变化

Discovering Motifs in DNA Sequences: A Candidate Motifs Based Approach

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅