【24h】

Perfect phylogeny and haplotype assignment

机译:完善的系统发育和单倍型分配

获取原文

摘要

This paper is concerned with the reconstruction of perfect phylogenies from binary character data with missing values, and related problems of inferring complete haplotypes from haplotypes or genotypes with missing data. In cases where the problems considered are NP-hard we assume a rich data hypothesis under which they become tractable. Natural probabilistic models are introduced for the generation of character vectors, haplotypes or genotypes with missing data, and it is shown that these models support the rich data hypothesis. The principal results include:
  • A near-linear time algorithm for inferring a perfect phylogeny from binary character data (or haplotype data) with missing values, under the rich data hypothesis;
  • A quadratic-time algorithm for inferring a perfect phylogeny from genotype data with missing values with high probability, under certain distributional assumptions;
  • Demonstration that the problems of maximum-likelihood inference of complete haplotypes from partial haplotypes or partial genotypes can be cast as minimum-entropy disjoint set cover problems;
  • In the case where the haplotypes come from a perfect phylogeny, a representation of the set cover problem as minimum-entropy covering of subtrees of a tree by nodes;
  • An exact algorithm for minimum-entropy subtree covering, and demonstration that it runs in polynomial time when the subtrees have small diameter;
  • Demonstration that a simple greedy approximation algorithm solves the minimum-entropy subtree covering problem with relative error tending to zero when the number of partial haplotypes per complete haplotype is large;
  • An asymptotically consistent method of estimating the frequencies of the complete haplotypes in a perfect phylogeny, under an iid model for the distribution of missing data;
  • Computational results on real data demonstrating the effectiveness of a the greedy algorithm for inferring haplotypes from genotypes with missing data, even inthe absence of a perfect phylogeny.
.
机译:本文涉及从具有缺失值的二元字符数据重建完美的系统发育,以及与从具有缺失数据的单倍型或基因型中推断出完整单倍型有关的问题。如果考虑的问题是 NP 困难的情况,我们假设丰富的数据假设使它们变得易于处理。引入自然概率模型来生成缺少数据的字符向量,单倍型或基因型,并证明这些模型支持丰富的数据假设。主要结果包括:
  • 一种近线性时间算法,可根据丰富的数据假设从具有缺失值的二进制字符数据(或单倍型数据)推断出完美的系统发育;
  • 一种二次时间算法,可以在某些分布假设下,从具有缺失值的基因型数据中推断出完美的系统发育;
  • 论证了从部分单倍型或部分基因型推断出完整单倍型的最大似然性问题可以解释为最小熵不相交集覆盖问题;
  • 在单倍型来自完美的系统发育的情况下,将集覆盖问题表示为节点对树的子树的最小熵覆盖;
  • 一种最小熵子树覆盖的精确算法,并证明了当子树直径较小时,它可以在多项式时间内运行;
  • 证明了一种简单的贪婪近似算法可以解决最小熵子树覆盖问题,当每个完整单元型的部分单元型的数量较大时,相对误差趋于零;
  • 在缺失数据分布的iid模型下,一种渐进一致的方法,用于估计理想系统发育中完整单倍型的频率;
  • 真实数据的计算结果证明了贪心算法从缺少数据的基因型中推断单倍型的有效性,即使在没有完善的系统发育的情况下也是如此。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号