...
首页> 外文期刊>Journal of computational biology: A journal of computational molecular cell biology >On Computing Breakpoint Distances for Genomes with Duplicate Genes
【24h】

On Computing Breakpoint Distances for Genomes with Duplicate Genes

机译:在计算具有重复基因的基因组的断点距离

获取原文
获取原文并翻译 | 示例

摘要

A fundamental problem in comparative genomics is to compute the distance between two genomes in terms of its higher level organization (given by genes or syntenic blocks). For two genomes without duplicate genes, we can easily define (and almost always efficiently compute) a variety of distance measures, but the problem is NP-hard under most models when genomes contain duplicate genes. To tackle duplicate genes, three formulations (exemplar, maximum matching, and any matching) have been proposed, all of which aim to build a matching between homologous genes so as to minimize some distance measure. Of the many distance measures, the breakpoint distance (the number of nonconserved adjacencies) was the first one to be studied and remains of significant interest because of its simplicity and model-free property. The three breakpoint distance problems corresponding to the three formulations have been widely studied. Although we provided last year a solution for the exemplar problem that runs very fast on full genomes, computing optimal solutions for the other two problems has remained challenging. In this article, we describe very fast, exact algorithms for these two problems. Our algorithms rely on a compact integer-linear program that we further simplify by developing an algorithm to remove variables, based on new results on the structure of adjacencies and matchings. Through extensive experiments using both simulations and biological data sets, we show that our algorithms run very fast (in seconds) on mammalian genomes and scale well beyond. We also apply these algorithms (as well as the classic orthology tool MSOAR) to create orthology assignment, then compare their quality in terms of both accuracy and coverage. We find that our algorithm for the any matching formulation significantly outperforms other methods in terms of accuracy while achieving nearly maximum coverage.
机译:比较基因组学中的一个基本问题是根据其更高水平的组织(由基因或同期嵌段给出)计算两个基因组之间的距离。对于没有重复基因的两个基因组,我们可以容易地定义(并且几乎总是有效地计算)各种距离测量,但是当基因组含有重复基因时,问题在大多数模型下的问题是NP - 硬。为了解决重复基因,已经提出了三种制剂(示例性,最大匹配和任何匹配),所有这些都旨在在同源基因之间构建匹配,以便最小化一些距离测量。在许多距离措施中,断点距离(非经常性邻接的数量)是第一个要研究的距离,并且由于其简单性和无模型物业,因此是第一个和重大兴趣的遗骸。已经广泛研究了与三种配方对应的三个断点距离问题。虽然我们在去年提供了一个解决方案的解决方案,但在全基因组上运行非常快,虽然在全基因组上运行非常快,但计算其他两个问题的最佳解决方案仍然具有挑战性。在本文中,我们描述了这两个问题的非常快速,精确的算法。我们的算法依赖于一个紧凑的整数线程,我们通过开发算法来删除变量的算法,基于新的结果对邻接和匹配的结构进行了新的结果。通过使用两种模拟和生物数据集进行广泛的实验,我们表明我们的算法在哺乳动物基因组上运行非常快(以秒为单位),并远远缩放。我们还将这些算法(以及经典的正轨工具MsoAL)应用于创建原子大学分配,然后在精度和覆盖范围内比较它们的质量。我们发现,我们的任何匹配配方的算法在准确性方面显着优于其他方法,同时实现了几乎最大的覆盖率。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号