首页> 外文期刊>Bioinformatics >Efficient computation of all perfect repeats in genomic sequences of up to half a gigabyte, with a case study on the human genome
【24h】

Efficient computation of all perfect repeats in genomic sequences of up to half a gigabyte, with a case study on the human genome

机译:以人类基因组为例,有效计算高达半千兆字节的基因组序列中的所有完美重复

获取原文
获取原文并翻译 | 示例
       

摘要

Motivation: There is a significant ongoing research to identify the number and types of repetitive DNA sequences. As more genomes are sequenced, efficiency and scalability in computational tools become mandatory. Existing tools fail to find distant repeats because they cannot accommodate whole chromosomes, but segments. Also, a quantitative framework for repetitive elements inside a genome or across genomes is still missing.Results: We present a new efficient algorithm and its implementation as a software tool to compute all perfect repeats in inputs of up to 500 million nucleotide bases, possibly containing many genomes. Our algorithm is based on a suffix array construction and a novel procedure to extract all perfect repeats in the entire input, that can be arbitrarily distant, and with no bound on the repeat length. We tested the software on the Homo sapiens DNA genome NCBI 36.49. We computed all perfect repeats of at least 40 bases occurring in any two chromosomes with exact matching. We found that each H. sapiens chromosome shares similar to 10% of its full sequence with every other human chromosome, distributed more or less evenly among the chromosome surfaces. We give statistics including a quanti. cation of repeats by diversity, length and number of occurrences. We compared the computed repeats against all biological repeats currently obtainable from Ensembl enlarged with the output of the dust program and all elements identified by TRF and RepeatMasker (ftp://ftp.ebi.ac.uk/pub/databases/ ensembl/jherrero/.repeats/ all_repeats.txt.bz2). We report novel repeats as well as new occurrences of repeats matching with known biological elements.
机译:动机:正在进行一项重要的研究,以鉴定重复DNA序列的数量和类型。随着更多的基因组被测序,计算工具中的效率和可扩展性变得必不可少。现有工具无法找到遥远的重复序列,因为它们无法容纳整个染色体,而是片段。同样,基因组内部或整个基因组中重复元素的定量框架仍然缺失。结果:我们提出了一种新的高效算法,并将其作为软件工具来计算,可计算多达5亿个核苷酸碱基的输入中的所有完美重复。许多基因组。我们的算法基于后缀数组构造和新颖的过程,可提取整个输入中的所有完美重复,该重复可以任意距离,并且对重复长度没有限制。我们在智人DNA基因组NCBI 36.49上测试了该软件。我们计算出在任何两个染色体上至少有40个碱基的所有完美重复,且具有精确匹配。我们发现,每个智人染色体与其他所有人类染色体共享的序列接近其全序列的10%,在染色体表面之间或多或少均匀地分布。我们提供包括数量在内的统计信息。按多样性,长度和出现次数对重复序列进行阳离子分析。我们将计算出的重复数与当前从Ensembl可获得的所有生物重复数进行了比较,并与除尘程序的输出以及TRF和RepeatMasker(ftp://ftp.ebi.ac.uk/pub/databases/ ensembl / jherrero / .repeats / all_repeats.txt.bz2)。我们报告新颖的重复以及与已知的生物学元素匹配的重复出现的新情况。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号