首页> 美国卫生研究院文献>Journal of Computational Biology >The Distribution of Word Matches Between Markovian Sequences with Periodic Boundary Conditions
【2h】

The Distribution of Word Matches Between Markovian Sequences with Periodic Boundary Conditions

机译:具有周期边界条件的马尔可夫序列之间的单词匹配分布

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

>Word match counts have traditionally been proposed as an alignment-free measure of similarity for biological sequences. The D2 statistic, which simply counts the number of exact word matches between two sequences, is a useful test bed for developing rigorous mathematical results, which can then be extended to more biologically useful measures. The distributional properties of the D2 statistic under the null hypothesis of identically and independently distributed letters have been studied extensively, but no comprehensive study of the D2 distribution for biologically more realistic higher-order Markovian sequences exists. Here we derive exact formulas for the mean and variance of the D2 statistic for Markovian sequences of any order, and demonstrate through Monte Carlo simulations that the entire distribution is accurately characterized by a Pólya-Aeppli distribution for sequence lengths of biological interest. The approach is novel in that Markovian dependency is defined for sequences with periodic boundary conditions, and this enables exact analytic formulas for the mean and variance to be derived. We also carry out a preliminary comparison between the approximate D2 distribution computed with the theoretical mean and variance under a Markovian hypothesis and an empirical D2 distribution from the human genome.
机译:>字匹配计数传统上被提出为生物序列相似性的无比对度量。 D2统计信息仅计算两个序列之间精确单词匹配的次数,是用于得出严格数学结果的有用测试平台,然后可以将其扩展到更生物学上有用的度量。已经对D2统计量在相同且独立分布的字母的零假设下的分布特性进行了广泛的研究,但没有对生物学上更现实的高阶马尔可夫序列的D2分布进行全面的研究。在这里,我们得出任何阶数的马尔可夫序列的D2统计量的均值和方差的精确公式,并通过蒙特卡洛模拟证明,对于具有生物学意义的序列长度,整个分布由Pólya-Aeppli分布精确地表征。该方法是新颖的,因为为具有周期性边界条件的序列定义了马尔可夫相关性,这使得可以推导用于均值和方差的精确解析公式。我们还对根据马尔可夫假设在理论均值和方差下计算出的近似D2分布与人类基因组的经验D2分布进行了初步比较。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号