首页> 美国卫生研究院文献>other >Identification of repeat structure in large genomes using repeat probability clouds
【2h】

Identification of repeat structure in large genomes using repeat probability clouds

机译:使用重复概率云识别大型基因组中的重复结构

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

The identification of repeat structure in eukaryotic genomes can be time-consuming and difficult because of the large amount of information (~3×109 bp) that needs to be processed and compared. We introduce a new approach based on exact word counts to evaluate, de novo, the repeat structure present within large eukaryotic genomes. This approach avoids sequence alignment and similarity search, two of the most time-consuming components of traditional methods for repeat identification. Algorithms were implemented to efficiently calculate exact counts for any length oligonucleotide in large genomes. Based on these oligonucleotide counts, oligonucleotide excess probability clouds, or “P-clouds”, were constructed. P-clouds are composed of clusters of related oligonucleotides that occur, as a group, more often than expected by chance. After construction, P-clouds were mapped back onto the genome, and regions of high P-cloud density were identified as repetitive regions based on a sliding window approach. This efficient method is capable of analyzing the repeat content of the entire human genome on a single desktop computer in less than half a day, at least 10-fold faster than current approaches. The predicted repetitive regions strongly overlap with known repeat elements, as well as other repetitive regions such as gene families, pseudogenes and segmental duplicons. This method should be extremely useful as a tool for use in de novo identification of repeat structure in large newly sequenced genomes.
机译:由于需要处理和比较大量信息(〜3×10 9 bp),因此真核生物基因组中重复结构的鉴定可能既费时又困难。我们引入一种基于精确字数的新方法来从头评估大型真核生物基因组中存在的重复结构。这种方法避免了序列比对和相似性搜索,这是传统方法中重复识别最耗时的两个部分。实施算法以有效计算大型基因组中任何长度的寡核苷酸的精确计数。基于这些寡核苷酸计数,构建了寡核苷酸过量概率云或“ P云”。 P云由相关寡核苷酸簇组成,这些寡核苷酸作为一个组出现的频率比偶然发生的频率高。构建后,P云被重新映射到基因组上,并基于滑动窗口方法将高P云密度的区域识别为重复区域。这种有效的方法能够在不到半天的时间内,在一台台式计算机上分析整个人类基因组的重复内容,比目前的方法快至少十倍。预测的重复区域与已知的重复元件以及其他重复区域(如基因家族,假基因和区段双链体)强烈重叠。这种方法作为从头鉴定大型新测序基因组中重复结构的工具应该非常有用。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号