首页> 外文OA文献 >Blind method for discovering number of clusters in multidimensional datasets by regression on linkage hierarchies generated from random data
【2h】

Blind method for discovering number of clusters in multidimensional datasets by regression on linkage hierarchies generated from random data

机译:通过从随机数据生成的链接层次结构上的回归在多维数据集中发现多维数据集数量的盲方法

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Determining intrinsic number of clusters in a multidimensional dataset is a commonly encountered problem in exploratory data analysis. Unsupervised clustering algorithms often rely on specification of cluster number as an input parameter. However, this is typically not known a priori. Many methods have been proposed to estimate cluster number, including statistical and information-theoretic approaches such as the gap statistic, but these methods are not always reliable when applied to non-normally distributed datasets containing outliers or noise. In this study, I propose a novel method called hierarchical linkage regression, which uses regression to estimate the intrinsic number of clusters in a multidimensional dataset. The method operates on the hypothesis that the organization of data into clusters can be inferred from the hierarchy generated by partitioning the dataset, and therefore does not directly depend on the specific values of the data or their distribution, but on their relative ranking within the partitioned set. Moreover, the technique does not require empirical data to train on, but can use synthetic data generated from random distributions to fit regression coefficients. The trained hierarchical linkage regression model is able to infer cluster number in test datasets of varying complexity and differing distributions, for image, text and numeric data, using the same regression model without retraining. The method performs favourably against other cluster number estimation techniques, and is also robust to parameter changes, as demonstrated by sensitivity analysis. The apparent robustness and generalizability of hierarchical linkage regression make it a promising tool for unsupervised exploratory data analysis and discovery.
机译:确定多维数据集中的内在数量的群集是探索数据分析中的通常遇到的问题。无监督的聚类算法通常依赖于作为输入参数的簇号的规范。然而,这通常不知道先验。已经提出了许多方法来估计群集号,包括统计和信息 - 理论方法,例如间隙统计,但在应用于包含异常值或噪声的非正常分布的数据集时,这些方法并不总是可靠的。在这项研究中,我提出了一种称为分层连接回归的新方法,它使用回归来估计多维数据集中的内在数量的集群。该方法对假设进行操作,即可以从通过分区数据集生成的层次结构中推断数据组织到集群中,因此不直接取决于数据的特定值或其分发,而是在分区内的相对排名上放。此外,该技术不需要训练经验数据,而是可以使用从随机分布生成的合成数据来配合回归系数。训练有素的分层连接回归模型能够在变形复杂度和不同分布的测试数据集中推断群集编号,用于图像,文本和数字数据,在不培训的情况下使用相同的回归模型。该方法对其他簇数估计技术有利地执行,并且对于参数变化也是坚固的,如灵敏度分析所证明的。分层连锁回归的明显稳健性使其成为无监督探索性数据分析和发现的有希望的工具。

著录项

  • 作者

    Osbert C. Zalay;

  • 作者单位
  • 年度 2020
  • 总页数
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号