首页> 美国卫生研究院文献>PLoS Clinical Trials >On the Representability of Complete Genomes by Multiple Competing Finite-Context (Markov) Models
【2h】

On the Representability of Complete Genomes by Multiple Competing Finite-Context (Markov) Models

机译:基于多重竞争有限上下文(Markov)模型的完整基因组可表示性

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

A finite-context (Markov) model of order yields the probability distribution of the next symbol in a sequence of symbols, given the recent past up to depth . Markov modeling has long been applied to DNA sequences, for example to find gene-coding regions. With the first studies came the discovery that DNA sequences are non-stationary: distinct regions require distinct model orders. Since then, Markov and hidden Markov models have been extensively used to describe the gene structure of prokaryotes and eukaryotes. However, to our knowledge, a comprehensive study about the potential of Markov models to describe complete genomes is still lacking. We address this gap in this paper. Our approach relies on (i) multiple competing Markov models of different orders (ii) careful programming techniques that allow orders as large as sixteen (iii) adequate inverted repeat handling (iv) probability estimates suited to the wide range of context depths used. To measure how well a model fits the data at a particular position in the sequence we use the negative logarithm of the probability estimate at that position. The measure yields information profiles of the sequence, which are of independent interest. The average over the entire sequence, which amounts to the average number of bits per base needed to describe the sequence, is used as a global performance measure. Our main conclusion is that, from the probabilistic or information theoretic point of view and according to this performance measure, multiple competing Markov models explain entire genomes almost as well or even better than state-of-the-art DNA compression methods, such as XM, which rely on very different statistical models. This is surprising, because Markov models are local (short-range), contrasting with the statistical models underlying other methods, where the extensive data repetitions in DNA sequences is explored, and therefore have a non-local character.
机译:阶的有限上下文(Markov)模型会给出给定最近的过去深度,从而产生一系列符号中下一个符号的概率分布。马尔可夫建模长期以来一直应用于DNA序列,例如查找基因编码区。最初的研究带来了DNA序列不稳定的发现:不同的区域需要不同的模型顺序。从那时起,马尔可夫模型和隐马尔可夫模型被广泛用于描述原核生物和真核生物的基因结构。然而,据我们所知,仍然缺乏关于马尔可夫模型描述完整基因组潜力的全面研究。我们在本文中解决了这一空白。我们的方法依赖于(i)不同阶的多个竞争性Markov模型(ii)允许高达16阶的阶的谨慎编程技术(iii)足够的反向重复处理(iv)适用于所使用的广泛上下文深度的概率估计。为了衡量模型在序列中特定位置的数据拟合程度,我们使用该位置概率估计值的负对数。该度量产生序列的信息分布图,这些信息分布图是独立感兴趣的。整个序列的平均值(相当于描述序列所需的每个碱基的平均位数)用作全局性能度量。我们的主要结论是,从概率论或信息论的观点出发,根据这种性能指标,多个竞争性马尔可夫模型可以解释整个基因组,其结果几乎比最先进的DNA压缩方法(例如XM)更好甚至更好。 ,它们依赖于非常不同的统计模型。这是令人惊讶的,因为马尔可夫模型是局部的(短程),这与其他方法所基于的统计模型形成了鲜明的对比,在其他方法中,DNA序列中的大量数据重复被研究出来,因此具有非局部性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号