...
首页> 外文期刊>Physical review, E. Statistical physics, plasmas, fluids, and related interdisciplinary topics >Systematic analysis of coding and noncoding DNA sequences using methods of statistical linguistics
【24h】

Systematic analysis of coding and noncoding DNA sequences using methods of statistical linguistics

机译:使用统计语言学方法对编码和非编码DNA序列进行系统分析

获取原文
获取原文并翻译 | 示例

摘要

We compare the statistical properties of coding and noncoding regions in eukaryotic and viral DNA sequences by adapting two tests developed for the analysis of natural languages and symbolic sequences. The data set comprises all 30 sequences of length above 50 000 base pairs in GenBank Release No. 81.0, as well as the recently published sequences of C.elegans chromosome III (2.2 Mbp) and yeast chromosome XI (661 Kbp). We find that for the three chromosomes we studied the statistical properties of noncoding regions appear to be closer to those observed in natural languages than those of the coding regions. In particular, (i) an n-tuple Zipf analysis of noncoding regions reveals a regime close to power-law behavior while the coding regions show logarithmic behavior over a wide interval, while (ii) an n-gram entropy measurement shows that the noncoding regions have a lower n-gram entropy (and hence a larger "n-gram redundancy") than the coding regions. In contrast to the three chromosomes, we find that for vertebrates—such as primates and rodents—and for viral DNA, the difference between the statistical properties of coding and noncoding regions is not pronounced and therefore the results of the analyses of the investigated sequences are less conclusive. After noting the intrinsic limitations of the n-gram redundancy analysis, we also briefly discuss the failure of zero- and first-order Markovian models or simple nucleotide repeats to account fully for these "linguistic" features of DNA. Finally, we emphasize that our results by no means prove the existence of a "language" in noncoding DNA.
机译:我们通过适应开发用于分析自然语言和符号序列的两个测试,比较了真核和病毒DNA序列中编码和非编码区的统计特性。该数据集包括GenBank版本81.0中所有30个长度超过50,000个碱基对的序列,以及最近发布的秀丽隐杆线虫染色体III(2.2 Mbp)和酵母染色体XI(661 Kbp)序列。我们发现,对于我们研究的三个染色体,非编码区的统计特性似乎比在自然语言中观察到的统计特性更接近于编码区。特别地,(i)非编码区域的n元组Zipf分析揭示了一种接近幂律行为的机制,而编码区域在很宽的间隔内显示了对数行为,而(ii)n元语法熵测量表明,非编码区域区域比编码区域具有更低的n-gram熵(因此具有更大的“ n-gram冗余”)。与这三个染色体相反,我们发现对于脊椎动物(例如灵长类动物和啮齿动物)以及对于病毒DNA,编码区和非编码区的统计特性之间的差异并不明显,因此对所研究序列的分析结果是结论性较差。在注意到n-gram冗余分析的固有局限性之后,我们还简要地讨论了零阶和一阶马尔可夫模型或简单核苷酸重复的失败,以充分说明DNA的这些“语言”特征。最后,我们强调我们的结果绝不能证明非编码DNA中存在“语言”。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号