...
首页> 外文期刊>Journal of Mathematical Biology >Coding sequence density estimation via topological pressure
【24h】

Coding sequence density estimation via topological pressure

机译:通过拓扑压力估计编码序列密度

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

We give a new approach to coding sequence (CDS) density estimation in genomic analysis based on the topological pressure, which we develop from a well known concept in ergodic theory. Topological pressure measures the 'weighted information content' of a finite word, and incorporates 64 parameters which can be interpreted as a choice of weight for each nucleotide triplet. We train the parameters so that the topological pressure fits the observed coding sequence density on the human genome, and use this to give ab initio predictions of CDS density over windows of size around 66,000 bp on the genomes of Mus Musculus, Rhesus Macaque and Drososphilia Melanogaster. While the differences between these genomes are too great to expect that training on the human genome could predict, for example, the exact locations of genes, we demonstrate that our method gives reasonable estimates for the 'coarse scale' problem of predicting CDS density. Inspired again by ergodic theory, the weightings of the nucleotide triplets obtained from our training procedure are used to define a probability distribution on finite sequences, which can be used to distinguish between intron and exon sequences from the human genome of lengths between 750 and 5,000 bp. At the end of the paper, we explain the theoretical underpinning for our approach, which is the theory of Thermodynamic Formalism from the dynamical systems literature. Mathematica and MATLAB implementations of our method are available at http://sourceforge.net/projects/topologicalpres/.
机译:我们提供了一种基于拓扑压力的基因组分析中编码序列(CDS)密度估计的新方法,该方法是我们从遍历理论中众所周知的概念发展而来的。拓扑压力测量一个有限字的“加权信息内容”,并包含64个参数,这些参数可以解释为每个核苷酸三联体的权重选择。我们训练这些参数,以便拓扑压力适合人类基因组上观察到的编码序列密度,并使用它从头开始预测Mus Musculus,Rheus猕猴和Drososphilia Melanogaster基因组上大小约为66,000 bp的窗口上的CDS密度。 。尽管这些基因组之间的差异太大,无法期望对人类基因组的训练可以预测例如基因的确切位置,但我们证明了我们的方法对预测CDS密度的“粗尺度”问题给出了合理的估计。再次受人体工程学理论的启发,从我们的训练过程中获得的核苷酸三联体的权重被用于定义有限序列上的概率分布,该概率分布可用于区分人类基因组中750至5,000 bp长度的内含子和外显子序列。在本文的最后,我们解释了这种方法的理论基础,即动力系统文献中的热力学形式主义理论。我们的方法的Mathematica和MATLAB实现可从http://sourceforge.net/projects/topologicalpres/获得。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号