首页> 外文会议>IEEE International Symposium on Information Theory >Capacity and expressiveness of genomic tandem duplication
【24h】

Capacity and expressiveness of genomic tandem duplication

机译:基因组串联复制的能力和表达能力

获取原文

摘要

The majority of the human genome consists of repeated sequences. An important type of repeats common in the human genome are tandem repeats, where identical copies appear next to each other. For example, in the sequence AGTCTGTGC, TGTG is a tandem repeat, namely, generated from AGTCTGC by a tandem duplication of length 2. In this work, we investigate the possibility of generating a large number of sequences from a small initial string (called the seed) by tandem duplications of bounded length. Our results include exact capacity values for certain tandem duplication string systems with alphabet sizes 2; 3; and 4. In addition, motivated by the role of DNA sequences in expressing proteins via RNA and the genetic code, we define the notion of the expressiveness of a tandem duplication system, as the feasibility of expressing arbitrary substrings. We then completely characterize the expressiveness of tandem duplication systems for general alphabet sizes and duplication lengths. Noticing that a system with capacity = 1 is expressive, we prove that for an alphabet size ≥ 4, the capacity is strictly smaller than 1, independent of the seed and the duplication lengths. The proof of this limit on the capacity (note that the genomic alphabet size is 4), is related to an interesting result by Axel Thue from 1906 which states that there exist arbitrary length sequences with no tandem repeats (square-free) for alphabet size ≥ 3. Finally, our results illustrate that duplication lengths play a more significant role than the seed in generating a large number of sequences for these systems.
机译:人类基因组的大部分由重复序列组成。在人类基因组中常见的一种重要的重复序列是串联重复序列,其中相同的拷贝彼此相邻出现。例如,在序列AGTCTGTGC中,TGTG是一个串联重复序列,即由AGTCTGC通过长度2的串联重复生成。在这项工作中,我们研究了从较小的初始字符串(称为种子)以有限长度进行串联重复。我们的结果包括字母大小为2的某些串联复制字符串系统的准确容量值; 3;和4.另外,受DNA序列在通过RNA和遗传密码表达蛋白质中的作用的激励,我们将串联复制系统表达的概念定义为表达任意子串的可行性。然后,我们针对通用字母大小和重复长度完全表征了串联复制系统的表现力。注意到容量为1的系统是可表示的,我们证明对于字母大小≥4,该容量严格小于1,与种子和复制长度无关。这种容量限制的证明(请注意,基因组字母大小为4)与1906年Axel Thue的有趣结果有关,该结果指出存在任意长度的序列,字母序列没有串联重复(无平方)。 ≥3。最后,我们的结果表明,在为这些系统生成大量序列时,重复长度比种子起着更重要的作用。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号