...
首页> 外文期刊>Ecology and Evolution >Resemblance profiles as clustering decision criteria: Estimating statistical power, error, and correspondence for a hypothesis test for multivariate structure
【24h】

Resemblance profiles as clustering decision criteria: Estimating statistical power, error, and correspondence for a hypothesis test for multivariate structure

机译:相似性作为聚类决策标准:估计多元结构假设检验的统计功效,误差和对应性

获取原文
   

获取外文期刊封面封底 >>

       

摘要

Abstract Clustering data continues to be a highly active area of data analysis, and resemblance profiles are being incorporated into ecological methodologies as a hypothesis testing-based approach to clustering multivariate data. However, these new clustering techniques have not been rigorously tested to determine the performance variability based on the algorithm's assumptions or any underlying data structures. Here, we use simulation studies to estimate the statistical error rates for the hypothesis test for multivariate structure based on dissimilarity profiles (DISPROF). We concurrently tested a widely used algorithm that employs the unweighted pair group method with arithmetic mean (UPGMA) to estimate the proficiency of clustering with DISPROF as a decision criterion. We simulated unstructured multivariate data from different probability distributions with increasing numbers of objects and descriptors, and grouped data with increasing overlap, overdispersion for ecological data, and correlation among descriptors within groups. Using simulated data, we measured the resolution and correspondence of clustering solutions achieved by DISPROF with UPGMA against the reference grouping partitions used to simulate the structured test datasets. Our results highlight the dynamic interactions between dataset dimensionality, group overlap, and the properties of the descriptors within a group (i.e., overdispersion or correlation structure) that are relevant to resemblance profiles as a clustering criterion for multivariate data. These methods are particularly useful for multivariate ecological datasets that benefit from distance-based statistical analyses. We propose guidelines for using DISPROF as a clustering decision tool that will help future users avoid potential pitfalls during the application of methods and the interpretation of results.
机译:摘要聚类数据仍然是数据分析的一个非常活跃的领域,相似性概况已作为一种基于假设检验的聚类多元数据的方法而被纳入生态学方法中。但是,尚未对这些新的聚类技术进行严格的测试以根据算法的假设或任何基础数据结构确定性能可变性。在这里,我们使用模拟研究来估计基于不相似性概况(DISPROF)的多元结构假设检验的统计错误率。我们同时测试了一种广泛使用的算法,该算法采用算术平均值(UPGMA)的非加权对群方法来估计以DISPROF作为决策准则的聚类的效率。我们模拟了来自不同概率分布的非结构化多元数据,其中对象和描述符的数量不断增加,分组数据的重叠率不断提高,生态数据过度分散,并且各组内描述符之间具有相关性。使用模拟数据,我们测量了由DISPROF使用UPGMA实现的聚类解决方案相对于用于模拟结构化测试数据集的参考分组分区的分辨率和对应性。我们的结果强调了数据集维数,组重叠和组内描述符的属性(即过度分散或相关结构)之间的动态交互作用,这些属性与相似性轮廓有关,这是多变量数据的聚类标准。这些方法对于受益于基于距离的统计分析的多元生态数据集特别有用。我们提出了使用DISPROF作为聚类决策工具的指南,该指南将帮助未来的用户避免在方法的应用和结果解释过程中的潜在陷阱。

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号