首页> 外文期刊>Bioinformatics >Exact algorithms for haplotype assembly from whole-genome sequence data
【24h】

Exact algorithms for haplotype assembly from whole-genome sequence data

机译:从全基因组序列数据中进行单倍型装配的精确算法

获取原文
获取原文并翻译 | 示例
       

摘要

Motivation: Haplotypes play a crucial role in genetic analysis and have many applications such as gene disease diagnoses, association studies, ancestry inference and so forth. The development of DNA sequencing technologies makes it possible to obtain haplotypes from a set of aligned reads originated from both copies of a chromosome of a single individual. This approach is often known as haplotype assembly. Exact algorithms that can give optimal solutions to the haplotype assembly problem are highly demanded. Unfortunately, previous algorithms for this problem either fail to output optimal solutions or take too long time even executed on a PC cluster. Results: We develop an approach to finding optimal solutions for the haplotype assembly problem under the minimum-error-correction (MEC) model. Most of the previous approaches assume that the columns in the input matrix correspond to (putative) heterozygous sites. This all-heterozygous assumption is correct for most columns, but it may be incorrect for a small number of columns. In this article, we consider the MEC model with or without the all-heterozygous assumption. In our approach, we first use new methods to decompose the input read matrix into small independent blocks and then model the problem for each block as an integer linear programming problem, which is then solved by an integer linear programming solver. We have tested our program on a single PC [a Linux (x64) desktop PC with i7-3960X CPU], using the filtered HuRef and the NA 12878 datasets (after applying some variant calling methods). With the all-heterozygous assumption, our approach can optimally solve the whole HuRef data set within a total time of 31 h (26 h for the most difficult block of the 15th chromosome and only 5 h for the other blocks). To our knowledge, this is the first time that MEC optimal solutions are completely obtained for the filtered HuRef dataset. Moreover, in the general case (without the all-heterozygous assumption), for the HuRef dataset our approach can optimally solve all the chromosomes except the most difficult block in chromosome 15 within a total time of 12 days. For both of the HuRef and NA12878 datasets, the optimal costs in the general case are sometimes much smaller than those in the all-heterozygous case. This implies that some columns in the input matrix (after applying certain variant calling methods) still correspond to false-heterozygous sites.
机译:动机:单倍型在遗传分析中起着至关重要的作用,并具有许多应用,例如基因疾病诊断,关联研究,祖先推断等。 DNA测序技术的发展使得可以从源自单个个体的两个染色体拷贝的一组比对读取中获得单倍型。这种方法通常称为单元型组装。迫切需要能够为单倍型装配问题提供最佳解决方案的精确算法。不幸的是,以前针对该问题的算法要么无法输出最佳解决方案,要么即使在PC集群上执行也花费很长时间。结果:我们开发了一种在最小误差校正(MEC)模型下为单倍型装配问题寻找最佳解决方案的方法。大多数以前的方法都假定输入矩阵中的列对应于(假定的)杂合位点。这种全杂合的假设对大多数列都是正确的,但对于少数列则可能是不正确的。在本文中,我们考虑带有或不带有全杂合假设的MEC模型。在我们的方法中,我们首先使用新方法将输入读取矩阵分解为小的独立块,然后将每个块的问题建模为整数线性规划问题,然后由整数线性规划求解器进行求解。我们已经使用过滤后的HuRef和NA 12878数据集(在应用了一些变体调用方法之后)在单台PC [具有i7-3960X CPU的Linux(x64)台式机]上测试了程序。使用全杂合假设,我们的方法可以在31小时(第15个染色体中最困难的区块为26小时,其他区块仅为5小时)的总时间内最佳求解整个HuRef数据集。据我们所知,这是首次针对滤波后的HuRef数据集完全获得MEC最优解。此外,在一般情况下(没有全杂合假设),对于HuRef数据集,我们的方法可以在12天的总时间内最佳地求解除15号染色体中最困难的区域以外的所有染色体。对于HuRef和NA12878数据集,一般情况下的最佳成本有时要比全杂合情况下的最佳成本小得多。这意味着输入矩阵中的某些列(在应用某些变体调用方法之后)仍然对应于错误的杂合位点。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号