首页> 外文期刊>Bioinformatics >DISCOVER: a feature-based discriminative method for motif search in complex genomes
【24h】

DISCOVER: a feature-based discriminative method for motif search in complex genomes

机译:发现:复杂基因组中基于特征的判别方法,用于基序搜索

获取原文
获取原文并翻译 | 示例
       

摘要

Motivation: Identifying transcription factor binding sites (TFBSs) encoding complex regulatory signals in metazoan genomes remains a challenging problem in computational genomics. Due to degeneracy of nucleotide content among binding site instances or motifs, and intricate 'grammatical organization' of motifs within cis-regulatory modules (CRMs), extant pattern matching-based in silico motif search methods often suffer from impractically high false positive rates, especially in the context of analyzing large genomic datasets, and noisy position weight matrices which characterize binding sites. Here, we try to address this problem by using a framework to maximally utilize the information content of the genomic DNA in the region of query, taking cues from values of various biologically meaningful genetic and epigenetic factors in the query region such as clade-specific evolutionary parameters, presence/absence of nearby coding regions, etc. We present a new method for TFBS prediction in metazoan genomes that utilizes both the CRM architecture of sequences and a variety of features of individual motifs. Our proposed approach is based on a discriminative probabilistic model known as conditional random fields that explicitly optimizes the predictive probability of motif presence in large sequences, based on the joint effect of all such features.Results: This model overcomes weaknesses in earlier methods based on less effective statistical formalisms that are sensitive to spurious signals in the data. We evaluate our method on both simulated CRMs and real Drosophila sequences in comparison with a wide spectrum of existing models, and outperform the state of the art by 22% in F1 score.
机译:动机:确定后生动物基因组中编码复杂调控信号的转录因子结合位点(TFBS)在计算基因组学中仍然是一个具有挑战性的问题。由于结合位点实例或基序之间核苷酸含量的简并性以及顺式调控模块(CRM)中基序的复杂“语法组织”,基于现存模式匹配的计算机模拟基序搜索方法经常遭受不切实际的高假阳性率,尤其是分析大型基因组数据集以及表征结合位点的嘈杂的位置权重矩阵。在这里,我们尝试通过使用框架来最大程度地利用查询区域中基因组DNA的信息内容来解决此问题,并从查询区域中各种生物学上有意义的遗传和表观遗传因素(例如进化枝特异性进化)的值中获取线索参数,附近编码区的存在/不存在等。我们介绍了一种后生动物基因组中TFBS预测的新方法,该方法利用了序列的CRM结构和各个基序的多种功能。我们提出的方法基于一种称为条件随机场的判别概率模型,该模型基于所有这些特征的共同作用,显着优化了大序列中基序存在的预测概率。结果:该模型克服了早期方法基于不足对数据中的虚假信号敏感的有效统计形式。与广泛的现有模型相比,我们在模拟CRM和实际果蝇序列上评估了我们的方法,并且在F1得分方面领先于现有技术22%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号