首页> 外文学位 >On issues of singularity for confidence regions and hypothesis tests for topologies using generalized least squares.
【24h】

On issues of singularity for confidence regions and hypothesis tests for topologies using generalized least squares.

机译:关于置信区域的奇异性问题和使用广义最小二乘法的拓扑假设检验。

获取原文
获取原文并翻译 | 示例

摘要

Recently, Susko [31] described a computationally inexpensive way to construct confidence regions (CR) for topologies using a generalized least squares (GLS) test statistic, with chi square distribution, which applies to maximum likelihood (ML) distances. A software implementation for both nucleotide and protein data, called glsdna and glsprot respectively, were also provided by Susko [32]. The accuracy of both the GLS test statistic and sample average approximations used for the variances and covariances for the ML distances are asymptotic in the number of sites; however, in practice usable sequences may be only hundreds of characters long. It is untested just how GLS will perform under these conditions.; In this thesis, a simulation study is undertaken to gauge the consequences of these asymptotic limitations. To this end, 4 and 7 taxon trees were used to simulate nucleotide sequence data for each of the lengths 50, 100, 250, 500, 1000, 5000, and 10000. For each tree used, and each sequence length, on the order of 10000 CR's were generated, and the coverage probability of the true tree, size of each CR, estimated ML distances, and estimated sample average variances-covariances were recorded. It was found that the coverage probabilities agreed with what is expected asymptotically for sequence lengths 1000 and higher. For smaller sample sizes the coverage probabilities were generally found to be higher than the 0.95 value. It was anticipated that, for small sample sizes, the coverage probabilities would attain the expected 0.95 value, if the true covariances were used to compute the GLS test statistic. Surprisingly, the coverage probabilities were drastically underestimated. The underlying cause can be attributed to a tendency for the ML distances to be overestimated for small sequence lengths together with what we found to be exponential increase in variance with distance between taxa.; The second part of this thesis is directed toward fixing a serious limitation of the GLS software. Namely, computation of the GLS test statistic requires the estimated covariance matrix of the ML distances to be invertible. If singularity does occur, then the test statistic cannot be computed and the programs will crash. In molecular evolution models, the covariance matrix is a function of the substitution model and the underlying tree but it is not generally known what types of trees and models cause singular covariance matrices. In this thesis, we show that singular covariance matrices arise if and only if some distance is exactly 0 or equivalently when a pair of taxa have identical sequences with probability 1. However, in practice the covariance matrix must be estimated and the underlying causes of singularity are more complex. A necessary condition for singularity in the estimated covariance matrix is given, as well as two sufficient conditions which are: (1) The number of distinct nucleotide patterns at a site is less than the number of pairs of taxa, and (2) A special type of linear dependence is constructed in the rows of the estimated covariance matrix.; Finally, two alternatives to using the glsdna and glsprot routines are introduced which allow for the construction of a CR even when the covariance matrix is singular. First, the routines glsdna_eig and glsprot_eig, as described in [32], use an eigenvalue cutoff approach. The causes of singularity described in this thesis led to an alternative approach which uses a distance cutoff, or in other words, groups of taxa which are closely related are combined together before computing a CR. This approach is implemented as glsdna_dist and glsprot_dist. These different approaches were compared via a simulation on two 8 taxon trees using nucleotide sequence data. Briefly, the results show that for small samples the glsdna_dist routine gives better coverage probabilities and far smaller CR sizes than those obtained by using glsdna_eig, while for longer sequence lengths the routines exhibit simil
机译:最近,Susko [31]描述了一种计算便宜的方法,该方法使用具有最小二乘方分布的广义最小二乘(GLS)测试统计量来构建拓扑的置信区域(CR),该方法适用于最大似然(ML)距离。 Susko还提供了一种用于核苷酸和蛋白质数据的软件实现,分别称为glsdna和glsprot [32]。用于ML距离的方差和协方差的GLS测试统计量和样本平均近似值的准确性在站点数量上是渐近的;但是,实际上,可用序列可能只有数百个字符长。在这些条件下,GLS的性能如何尚未经过测试。本文通过仿真研究来评估这些渐近限制的后果。为此,使用了4和7个分类单元树来模拟长度分别为50、100、250、500、1000、5000和10000的核苷酸序列数据。对于每个使用的树,以及每个序列长度,顺序为生成了10000个CR,并记录了真实树的覆盖概率,每个CR的大小,估计的ML距离和估计的样本平均方差-协方差。发现覆盖概率与序列长度为1000及更高时渐近地期望的一致。对于较小的样本量,通常发现覆盖率高于0.95的值。可以预料,如果使用真正的协方差来计算GLS检验统计量,则对于小样本量,覆盖率将达到预期的0.95值。令人惊讶的是,覆盖率被大大低估了。根本原因可归因于小距离序列长度的ML距离被高估的趋势,以及我们发现分类群之间距离的方差呈指数增长的趋势。本文的第二部分旨在解决GLS软件的严重缺陷。即,GLS检验统计量的计算要求ML距离的估计协方差矩阵是可逆的。如果确实发生奇异性,则无法计算测试统计量,并且程序将崩溃。在分子进化模型中,协方差矩阵是替代模型和基础树的函数,但是通常不知道哪种类型的树和模型会导致奇异的协方差矩阵。在本文中,我们证明了当且仅当一对距离具有相同的概率为1的某个分类单元时,奇异协方差矩阵才出现。但是,在实践中,必须估计协方差矩阵并找出奇异性的根本原因比较复杂。给出了估计的协方差矩阵中的奇异性的必要条件,以及两个足够的条件,它们是:(1)一个位点上不同核苷酸模式的数量少于分类单元对的数量,以及(2)一个特殊的在估计的协方差矩阵的行中构造线性相关性的类型。最后,介绍了使用glsdna和glsprot例程的两种替代方法,即使协方差矩阵是奇异的,也可以构造CR。首先,如[32]中所述,例程glsdna_eig和glsprot_eig使用特征值截止方法。本文中描述的奇异性原因导致了一种替代方法,该方法使用距离截止,换句话说,在计算CR之前,将密切相关的一组分类单元组合在一起。此方法实现为glsdna_dist和glsprot_dist。通过使用核苷酸序列数据在两棵8类群树上进行仿真,比较了这些不同的方法。简而言之,结果表明,与使用glsdna_eig获得的结果相比,对于小样本,glsdna_dist例程提供了更好的覆盖概率,并且CR大小更小,而对于更长的序列,例程显示了类似的结果。

著录项

  • 作者

    Sheridan, Paul.;

  • 作者单位

    Dalhousie University (Canada).;

  • 授予单位 Dalhousie University (Canada).;
  • 学科 Mathematics.
  • 学位 M.Sc.
  • 年度 2007
  • 页码 94 p.
  • 总页数 94
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 数学;
  • 关键词

  • 入库时间 2022-08-17 11:39:36

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号