Efficient computation of all perfect repeats in genomic sequences of up to half a gigabyte, with a case study on the human genome

Becher Veronica; Deymonnaz Alejandro; Heiber Pablo

首页> 外文期刊>Bioinformatics >Efficient computation of all perfect repeats in genomic sequences of up to half a gigabyte, with a case study on the human genome

【24h】

Efficient computation of all perfect repeats in genomic sequences of up to half a gigabyte, with a case study on the human genome

机译：以人类基因组为例，有效计算高达半千兆字节的基因组序列中的所有完美重复

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Motivation: There is a significant ongoing research to identify the number and types of repetitive DNA sequences. As more genomes are sequenced, efficiency and scalability in computational tools become mandatory. Existing tools fail to find distant repeats because they cannot accommodate whole chromosomes, but segments. Also, a quantitative framework for repetitive elements inside a genome or across genomes is still missing.Results: We present a new efficient algorithm and its implementation as a software tool to compute all perfect repeats in inputs of up to 500 million nucleotide bases, possibly containing many genomes. Our algorithm is based on a suffix array construction and a novel procedure to extract all perfect repeats in the entire input, that can be arbitrarily distant, and with no bound on the repeat length. We tested the software on the Homo sapiens DNA genome NCBI 36.49. We computed all perfect repeats of at least 40 bases occurring in any two chromosomes with exact matching. We found that each H. sapiens chromosome shares similar to 10% of its full sequence with every other human chromosome, distributed more or less evenly among the chromosome surfaces. We give statistics including a quanti. cation of repeats by diversity, length and number of occurrences. We compared the computed repeats against all biological repeats currently obtainable from Ensembl enlarged with the output of the dust program and all elements identified by TRF and RepeatMasker (ftp://ftp.ebi.ac.uk/pub/databases/ ensembl/jherrero/.repeats/ all_repeats.txt.bz2). We report novel repeats as well as new occurrences of repeats matching with known biological elements.

机译：动机：正在进行一项重要的研究，以鉴定重复DNA序列的数量和类型。随着更多的基因组被测序，计算工具中的效率和可扩展性变得必不可少。现有工具无法找到遥远的重复序列，因为它们无法容纳整个染色体，而是片段。同样，基因组内部或整个基因组中重复元素的定量框架仍然缺失。结果：我们提出了一种新的高效算法，并将其作为软件工具来计算，可计算多达5亿个核苷酸碱基的输入中的所有完美重复。许多基因组。我们的算法基于后缀数组构造和新颖的过程，可提取整个输入中的所有完美重复，该重复可以任意距离，并且对重复长度没有限制。我们在智人DNA基因组NCBI 36.49上测试了该软件。我们计算出在任何两个染色体上至少有40个碱基的所有完美重复，且具有精确匹配。我们发现，每个智人染色体与其他所有人类染色体共享的序列接近其全序列的10％，在染色体表面之间或多或少均匀地分布。我们提供包括数量在内的统计信息。按多样性，长度和出现次数对重复序列进行阳离子分析。我们将计算出的重复数与当前从Ensembl可获得的所有生物重复数进行了比较，并与除尘程序的输出以及TRF和RepeatMasker（ftp://ftp.ebi.ac.uk/pub/databases/ ensembl / jherrero / .repeats / all_repeats.txt.bz2）。我们报告新颖的重复以及与已知的生物学元素匹配的重复出现的新情况。

著录项

来源
《Bioinformatics》 |2009年第14期|共8页
作者
Becher Veronica; Deymonnaz Alejandro; Heiber Pablo;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类生物工程学（生物技术）;
关键词
genomic sequence; repetitive DNA sequence; gigabyte;

机译：基因组序列重复DNA序列千兆字节;

相似文献

外文文献
中文文献
专利

1. Efficient computation of all perfect repeats in genomic sequences of up to half a gigabyte, with a case study on the human genome [J] . Becher Veronica, Deymonnaz Alejandro, Heiber Pablo Bioinformatics . 2009,第14期

机译：以人类基因组为例，有效计算高达半千兆字节的基因组序列中的所有完美重复
2. A study on genomic distribution and sequence features of human long inverted repeats reveals species-specific intronic inverted repeats [J] . Wang Y, Leung FC The FEBS journal . 2009,第7期

机译：对人类长反向重复序列的基因组分布和序列特征的研究揭示了物种特异性内含子反向重复序列
3. Prevaccination genomic diversity of human papillomavirus genotype 11: a study on 63 clinical isolates and 10 full-length genome sequences. [J] . Maver PJ, Kocjan BJ, Seme K, Journal of Medical Virology . 2011,第3期

机译：人类乳头瘤病毒基因型11的疫苗接种前基因组多样性：对63种临床分离株和10个全长基因组序列的研究。
4. Space-Efficient Computation of Maximal and Supermaximal Repeats in Genome Sequences [C] . Timo Beller, Katharina Berger, Enno Ohlebusch International symposium on string processing and information retrieval . 2012

机译：基因组序列中最大和最大重复的空间高效计算
5. A computational genomics study: Characterizing genomic variants in non-coding regions of the human genome. [D] . Mu, Xinmeng. 2012

机译：计算基因组学研究：表征人类基因组非编码区的基因组变异。
6. ChloroMitoSSRDB: Open Source Repository of Perfect and Imperfect Repeats in Organelle Genomes for Evolutionary Genomics [O] . Gaurav Sablok, Suresh B. Mudunuri, Sujan Patnana, 2013

机译：ChloroMitoSSRDB：用于进化基因组学的细胞器基因组中完美和不完美重复的开源资料库
7. A study on genomic distribution and sequence features of human long inverted repeats reveals species-specific intronic inverted repeats [O] . Wang, Y, Leung, FCC 2009

机译：对人类长反转重复序列的基因组分布和序列特征的研究揭示了物种特异性内向反向重复序列
8. Identification of Novel Inverted Terminal Repeat (ITR) Deletions of Human Adenovirus (AD) From Infected Host: Virulent Ads Containing Mixed Populations of Genomic Sequences; Conference paper [R] . Houng, H. H., Binn, L., Kuschner, R., 2006

机译：从受感染的宿主中鉴定新的人类腺病毒（aD）的倒置末端重复序列（ITR）：含有基因组序列的混合群体的病毒广告;会议论文

Efficient computation of all perfect repeats in genomic sequences of up to half a gigabyte, with a case study on the human genome

摘要

著录项

相似文献

相关主题

期刊订阅