Seeing Distinct Groups Where There are None: Spurious Patterns from Between-Group PCA

Cardini Andrea; OHiggins Paul; Rohlf F. James

摘要

Using sampling experiments, we found that, when there are fewer groups than variables, between-groups PCA (bgPCA) may suggest surprisingly distinct differences among groups for data in which none exist. While apparently not noticed before, the reasons for this problem are easy to understand. A bgPCA captures the g - 1 dimensions of variation among the g group means, but only a fraction of the n-ary sumation ni-g dimensions of within-group variation (ni are the sample sizes), when the number of variables, p, is greater than g - 1. This introduces a distortion in the appearance of the bgPCA plots because the within-group variation will be underrepresented, unless the variables are sufficiently correlated so that the total variation can be accounted for with just g - 1 dimensions. The effect is most obvious when sample sizes are small relative to the number of variables, because smaller samples spread out less, but the distortion is present even for large samples. Strong covariance among variables largely reduces the magnitude of the problem, because it effectively reduces the dimensionality of the data and thus enables a larger proportion of the within-group variation to be accounted for within the g - 1-dimensional space of a bgPCA. The distortion will still be relevant though its strength will vary from case to case depending on the structure of the data (p, g, covariances etc.). These are important problems for a method mainly designed for the analysis of variation among groups when there are very large numbers of variables and relatively small samples. In such cases, users are likely to conclude that the groups they are comparing are much more distinct than they really are. Having many variables but just small sample sizes is a common problem in fields ranging from morphometrics (as in our examples) to molecular analyses.

机译：通过抽样实验，我们发现，当组数少于变量时，组间PCA（bgPCA）可能表明，对于不存在的数据，组间差异令人惊讶。虽然以前显然没有注意到，但这个问题的原因很容易理解。当变量数 p 大于 g - 1 时，bgPCA 捕获了 g 组均值之间变异的 g - 1 维数，但仅捕获组内变异的 n 元总和 ni-g 维数的一小部分（ni 是样本量）。这会在 bgPCA 图的外观中引入失真，因为组内变异将得到充分的代表性，除非变量具有充分的相关性，以便仅用 g - 1 维度即可解释总变异。当样本量相对于变量数量较小时，这种影响最为明显，因为较小的样本分散得更少，但即使对于较大的样本也存在失真。变量之间的强协方差在很大程度上降低了问题的严重性，因为它有效地降低了数据的维数，从而可以在 bgPCA 的 g - 1 维空间内解释更大比例的组内变异。失真仍然相关，但其强度会因数据结构（p、g、协方差等）而异。对于主要用于分析组间变异的方法来说，这些都是重要的问题，当变量数量非常多且样本相对较小时。在这种情况下，用户可能会得出结论，他们正在比较的组比实际情况要明显得多。在形态计量学（如我们的示例中）到分子分析等领域中，有许多变量但样本量很小是一个常见问题。

Seeing Distinct Groups Where There are None: Spurious Patterns from Between-Group PCA

摘要

著录项

引文网络

相关主题

期刊订阅