首页> 美国卫生研究院文献>Genomics Proteomics Bioinformatics >A Statistical Approach Designed for Finding Mathematically Defined Repeats in Shotgun Data and Determining the Length Distribution of Clone-Inserts
【2h】

A Statistical Approach Designed for Finding Mathematically Defined Repeats in Shotgun Data and Determining the Length Distribution of Clone-Inserts

机译:一种统计方法用于在Shot弹枪数据中查找数学定义的重复序列并确定克隆插入物的长度分布

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

The large amount of repeats, especially high copy repeats, in the genomes of higher animals and plants makes whole genome assembly (WGA) quite difficult. In order to solve this problem, we tried to identify repeats and mask them prior to assembly even at the stage of genome survey. It is known that repeats of different copy number have different probabilities of appearance in shotgun data, so based on this principle, we constructed a statistical model and inferred criteria for mathematically defined repeats (MDRs) at different shotgun coverages. According to these criteria, we developed software MDRmasker to identify and mask MDRs in shotgun data. With repeats masked prior to assembly, the speed of assembly was increased with lower error probability. In addition, clone-insert size affects the accuracy of repeat assembly and scaffold construction. We also designed length distribution of clone-inserts using our model. In our simulated genomes of human and rice, the length distribution of repeats is different, so their optimal length distributions of clone-inserts were not the same. Thus with optimal length distribution of clone-inserts, a given genome could be assembled better at lower coverage.
机译:高等动物和植物基因组中的大量重复,尤其是高拷贝重复,使得整个基因组组装(WGA)变得相当困难。为了解决这个问题,即使在基因组调查阶段,我们也尝试在组装之前鉴定重复序列并掩盖它们。众所周知,不同拷贝数的重复序列在shot弹枪数据中的出现概率不同,因此,基于此原理,我们构建了统计模型,并推断了在不同shot弹枪覆盖率下数学定义的重复序列(MDR)的标准。根据这些标准,我们开发了MDRmasker软件来识别和掩盖shot弹枪数据中的MDR。在组装之前掩盖重复序列的情况下,以较低的错误概率提高了组装速度。另外,克隆插入物的大小影响重复组装和支架构建的准确性。我们还使用模型设计了克隆插入物的长度分布。在我们模拟的人类和水稻基因组中,重复序列的长度分布不同,因此克隆插入物的最佳长度分布并不相同。因此,利用最佳的克隆插入物长度分布,可以在较低覆盖率下更好地组装给定的基因组。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号