Pathologies of Between-Groups Principal Components Analysis in Geometric Morphometrics

Bookstein Fred L.

摘要

Good empirical applications of geometric morphometrics (GMM) typically involve several times more variables than specimens, a situation the statistician refers to as "high p/n," where p is the count of variables and n the count of specimens. This note calls your attention to two predictable catastrophic failures of one particular multivariate statistical technique, between-groups principal components analysis (bgPCA), in this high-p/n setting. The more obvious pathology is this: when applied to the patternless (null) model of p identically distributed Gaussians over groups of the same size, both bgPCA and its algebraic equivalent, partial least squares (PLS) analysis against group, necessarily generate the appearance of huge equilateral group separations that are fictitious (absent from the statistical model). When specimen counts by group vary greatly or when any group includes fewer than about ten specimens, an even worse failure of the technique obtains: the smaller the group, the more likely a bgPCA is to fictitiously identify that group as the end-member of one of its derived axes. For these two reasons, when used in GMM and other high-p/n settings the bgPCA method very often leads to invalid or insecure biological inferences. This paper demonstrates and quantifies these and other pathological outcomes both for patternless models and for models with one or two valid factors, then offers suggestions for how GMM practitioners should protect themselves against the consequences for inference of these lamentably predictable misrepresentations. The bgPCA method should never be used unskeptically-it is always untrustworthy, never authoritative-and whenever it appears in partial support of any biological inference it must be accompanied by a wide range of diagnostic plots and other challenges, many of which are presented here for the first time.

机译：几何形态计量学（GMM）的良好实证应用通常涉及比样本多几倍的变量，统计学家将这种情况称为“高 p/n”，其中 p 是变量的计数，n 是样本的计数。本说明提请您注意在这种高 p/n 设置中，一种特定的多变量统计技术（组间主成分分析（bgPCA））的两个可预测的灾难性故障。更明显的病态是这样的：当应用于相同大小的群上 p 相同分布的高斯的无模式（零）模型时，bgPCA 及其代数等价的偏最小二乘（PLS）分析对群必然会产生巨大的等边群分离的外观，这些分离是虚构的（统计模型中没有）。当各组的样本数量差异很大时，或者当任何组包含的样本少于约十个时，该技术的失败甚至更严重：组越小，bgPCA就越有可能虚构地将该组识别为其派生轴之一的末端成员。由于这两个原因，当用于GMM和其他高p/n设置时，bgPCA方法通常会导致无效或不安全的生物学推断。本文演示并量化了无模式模型和具有一两个有效因素的模型的这些和其他病理结果，然后就GMM从业者应如何保护自己免受这些可悲的可预测的错误陈述的推断的后果提出了建议。bgPCA方法永远不应该被怀疑地使用——它总是不可信的，从来都不是权威的——每当它出现部分支持任何生物学推论时，它都必须伴随着广泛的诊断图和其他挑战，其中许多是首次在这里提出。

Pathologies of Between-Groups Principal Components Analysis in Geometric Morphometrics

摘要

著录项

引文网络

相关主题

期刊订阅