首页> 外文期刊>Journal of Computational Biology >GADEM: A Genetic Algorithm Guided Formation of Spaced Dyads Coupled with an EM Algorithm for Motif Discovery
【24h】

GADEM: A Genetic Algorithm Guided Formation of Spaced Dyads Coupled with an EM Algorithm for Motif Discovery

机译:GADEM:遗传算法指导间隔的二元组的形成与EM算法相结合的主题发现

获取原文
获取原文并翻译 | 示例

摘要

Genome-wide analyses of protein binding sites generate large amounts of data; a ChIP dataset might contain 10,000 sites. Unbiased motif discovery in such datasets is not generally feasible using current methods that employ probabilistic models. We propose an efficient method, GADEM, which combines spaced dyads and an expectation-maximization (EM) algorithm. Candidate words (four to six nucleotides) for constructing spaced dyads are prioritized by their degree of overrepresentation in the input sequence data. Spaced dyads are converted into starting position weight matrices (PWMs). GADEM then employs a genetic algorithm (GA), with an embedded EM algorithm to improve starting PWMs, to guide the evolution of a population of spaced dyads toward one whose entropy scores are more statistically significant. Spaced dyads whose entropy scores reach a pre-specified significance threshold are declared motifs. GADEM performed comparably with MEME on 500 sets of simulated “ChIP” sequences with embedded known P53 binding sites. The major advantage of GADEM is its computational efficiency on large ChIP datasets compared to competitors. We applied GADEM to six genome-wide ChIP datasets. Approximately, 15 to 30 motifs of various lengths were identified in each dataset. Remarkably, without any prior motif information, the expected known motif (e.g., P53 in P53 data) was identified every time. GADEM discovered motifs of various lengths (6–40 bp) and characteristics in these datasets containing from 0.5 to >13 million nucleotides with run times of 5 to 96 h. GADEM can be viewed as an extension of the well-known MEME algorithm and is an efficient tool for de novo motif discovery in large-scale genome-wide data. The GADEM software is available at www.niehs.nih.gov/research/resources/software/GADEM/.
机译:蛋白质结合位点的全基因组分析产生大量数据。一个ChIP数据集可能包含10,000个站点。使用当前采用概率模型的方法,在此类数据集中进行无偏基序发现通常是不可行的。我们提出了一种有效的方法GADEM,该方法结合了间隔双色和期望最大化(EM)算法。根据输入序列数据中过分代表的程度,优先排列用于构建间隔二元组的候选词(4至6个核苷酸)。隔开的二元组被转换为起始位置权重矩阵(PWM)。然后,GADEM采用遗传算法(GA)和嵌入式EM算法来改善启动PWM,以指导一群间隔成对的二元组向其熵值在统计上更为显着的方向发展。熵分数达到预先指定的显着性阈值的隔开的二元组被称为主题。 GADEM与MEME相比,对500套具有嵌入式已知P53结合位点的模拟“ ChIP”序列进行了比较。与竞争对手相比,GADEM的主要优势在于其在大型ChIP数据集上的计算效率。我们将GADEM应用于六个全基因组ChIP数据集。在每个数据集中,大约可以识别15至30个各种长度的图案。明显地,在没有任何先前的基序信息的情况下,每次都识别出预期的已知基序(例如,P53数据中的P53)。 GADEM在这些数据集中发现了各种长度(6–40 bp)的基序和特征,包含0.5至> 1,300万个核苷酸,运行时间为5至96 h。 GADEM可以看作是众所周知的MEME算法的扩展,是在大规模全基因组数据中从头发现基序的有效工具。可在www.niehs.nih.gov/research/resources/software/GADEM/上找到GADEM软件。

著录项

  • 来源
    《Journal of Computational Biology》 |2009年第2期|317-329|共13页
  • 作者

    Leping Li;

  • 作者单位

    Biostatistics Branch, National Institute of Environmental Health Sciences, NIH, Research Triangle Park, North Carolina.;

  • 收录信息
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号