首页> 外文期刊>Physical biology >The bulk and the tail of minimal absent words in genome sequences
【24h】

The bulk and the tail of minimal absent words in genome sequences

机译:基因组序列中最小缺失词的主体和尾部

获取原文
获取原文并翻译 | 示例
           

摘要

Minimal absent words (MAW) of a genomic sequence are subsequences that are absent themselves but the subwords of which are all present in the sequence. The characteristic distribution of genomic MAWs as a function of their length has been observed to be qualitatively similar for all living organisms, the bulk being rather short, and only relatively few being long. It has been an open issue whether the reason behind this phenomenon is statistical or reflects a biological mechanism, and what biological information is contained in absent words. In this work we demonstrate that the bulk can be described by a probabilistic model of sampling words from random sequences, while the tail of long MAWs is of biological origin. We introduce the concept of a core of a MAW, which are sequences present in the genome and closest to a given MAW. We show that in E. faecalis, E. coli and yeast the cores of the longest MAWs, which exist in two or more copies, are located in highly conserved regions the most prominent example being ribosomal RNAs. We also show that while the distribution of the cores of long MAWs is roughly uniform over these genomes on a coarse-grained level, on a more detailed level it is strongly enhanced in 3' untranslated regions (UTRs) and, to a lesser extent, also in 5' UTRs. This indicates that MAWs and associated MAW cores correspond to fine-tuned evolutionary relationships, and suggest that they can be more widely used as markers for genomic complexity.
机译:基因组序列的最小缺失词(MAW)是自身不存在的子序列,但是其子词都存在于序列中。已经观察到,对于所有活生物体而言,基因组MAWs随其长度变化的特征分布在质量上都相似,大部分较短,而只有很少一部分较长。这个现象背后的原因是统计上的还是反映生物学机制的,以及缺席单词中包含哪些生物学信息,一直是一个未决的问题。在这项工作中,我们证明了可以通过概率模型从随机序列中采样单词来描述主体,而长MAW的尾部是生物学起源的。我们介绍了MAW核心的概念,它是基因组中存在且最接近给定MAW的序列。我们显示在粪肠球菌,大肠杆菌和酵母中,存在两个或多个副本的最长MAW的核心位于高度保守的区域,最突出的例子是核糖体RNA。我们还显示,虽然长MAW核心的分布在这些基因组上在粗粒度水平上大致均匀,但在更详细的水平上,它在3'非翻译区(UTR)中得到了显着增强,在较小程度上,同样在5'UTR中。这表明MAW和相关的MAW核心对应于微调的进化关系,并表明它们可以更广泛地用作基因组复杂性的标记。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号