首页> 外文OA文献 >Blind method for discovering number of clusters in multidimensional datasets by regression on linkage hierarchies generated from random data

【2h】

Blind method for discovering number of clusters in multidimensional datasets by regression on linkage hierarchies generated from random data

机译：通过从随机数据生成的链接层次结构上的回归在多维数据集中发现多维数据集数量的盲方法

代理获取

本网站仅为用户提供外文OA文献查询和代理获取服务，本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文，但由于OA文献来源多样且变更频繁，仍可能出现获取不到、文献不完整或与标题不符等情况，如果获取不到我们将提供退款服务。请知悉。

页面导航

摘要
著录项
相似文献
相关主题

摘要

Determining intrinsic number of clusters in a multidimensional dataset is a commonly encountered problem in exploratory data analysis. Unsupervised clustering algorithms often rely on specification of cluster number as an input parameter. However, this is typically not known a priori. Many methods have been proposed to estimate cluster number, including statistical and information-theoretic approaches such as the gap statistic, but these methods are not always reliable when applied to non-normally distributed datasets containing outliers or noise. In this study, I propose a novel method called hierarchical linkage regression, which uses regression to estimate the intrinsic number of clusters in a multidimensional dataset. The method operates on the hypothesis that the organization of data into clusters can be inferred from the hierarchy generated by partitioning the dataset, and therefore does not directly depend on the specific values of the data or their distribution, but on their relative ranking within the partitioned set. Moreover, the technique does not require empirical data to train on, but can use synthetic data generated from random distributions to fit regression coefficients. The trained hierarchical linkage regression model is able to infer cluster number in test datasets of varying complexity and differing distributions, for image, text and numeric data, using the same regression model without retraining. The method performs favourably against other cluster number estimation techniques, and is also robust to parameter changes, as demonstrated by sensitivity analysis. The apparent robustness and generalizability of hierarchical linkage regression make it a promising tool for unsupervised exploratory data analysis and discovery.

机译：确定多维数据集中的内在数量的群集是探索数据分析中的通常遇到的问题。无监督的聚类算法通常依赖于作为输入参数的簇号的规范。然而，这通常不知道先验。已经提出了许多方法来估计群集号，包括统计和信息 - 理论方法，例如间隙统计，但在应用于包含异常值或噪声的非正常分布的数据集时，这些方法并不总是可靠的。在这项研究中，我提出了一种称为分层连接回归的新方法，它使用回归来估计多维数据集中的内在数量的集群。该方法对假设进行操作，即可以从通过分区数据集生成的层次结构中推断数据组织到集群中，因此不直接取决于数据的特定值或其分发，而是在分区内的相对排名上放。此外，该技术不需要训练经验数据，而是可以使用从随机分布生成的合成数据来配合回归系数。训练有素的分层连接回归模型能够在变形复杂度和不同分布的测试数据集中推断群集编号，用于图像，文本和数字数据，在不培训的情况下使用相同的回归模型。该方法对其他簇数估计技术有利地执行，并且对于参数变化也是坚固的，如灵敏度分析所证明的。分层连锁回归的明显稳健性使其成为无监督探索性数据分析和发现的有希望的工具。

著录项

作者
Osbert C. Zalay;
展开▼
作者单位

展开▼
年度 2020
总页数
原文格式 PDF
正文语种 eng
中图分类

相似文献

外文文献
中文文献
专利

1. Blind method for discovering number of clusters in multidimensional datasets by regression on linkage hierarchies generated from random data [J] . Osbert C. Zalay PLoS One . 2020,第1期

机译：从随机数据生成的链接层次结构上的回归发现多维数据集中的簇数的盲方法
2. BioFactHMM: MULTIDIMENSIONAL MODELING OF BIOLOGICAL DATA FROM HIDDEN MARKOV MODEL GENERATED DATASETS [J] . Manas Ranjan Pradhan, Beenu Mago, Deepak Kalra Indian Journal of Computer Science and Engineering . 2020,第4期

机译：Biofacthmm：隐马尔可夫模型生成数据集的生物数据的多维建模
3. BRAZILIAN HEALTHCARE RECORD LINKAGE (BRHC-RLK) - A RECORD LINKAGE METHODOLOGY FOR BRAZILIAN MEDICAL CLAIMS DATASETS (DATASUS) [J] . Campos D. F., Rosim R. P., Duva A. S., Value in health: the journal of the International Society for Pharmacoeconomics and Outcomes Research . 2017,第5期

机译：巴西医疗保健记录联动（BRHC-RLK） - 巴西医疗索赔数据集（DataSus）的记录联系方式
4. A Biclustering Method to Discover Co-regulated Genes Using Diverse Gene Expression Datasets [C] . Doruk Bozdag, Jeffrey D. Parvin, Umit V. Catalyurek Bioinformatics and computational biology . 2009

机译：一种使用不同基因表达数据集发现共同调控基因的整理方法
5. Data mining and pattern discovery using exploratory and visualization methods for large multidimensional datasets [D] . Li, Hsin-Fang 2013

机译：使用探索和可视化方法处理大型多维数据集的数据挖掘和模式发现
6. Blind method for discovering number of clusters in multidimensional datasets by regression on linkage hierarchies generated from random data [O] . Osbert C. Zalay 2020

机译：通过从随机数据生成的链接层次结构上的回归在多维数据集中发现多维数据集数量的盲方法
7. A Biclustering Method to Discover Co-regulated Genes Using Diverse Gene Expression Datasets ⋆ [O] . Jeffrey D. Parvin, Umit V. Catalyurek 2012

机译：使用多种基因表达数据集发现共调节基因的双聚类方法⋆

Blind method for discovering number of clusters in multidimensional datasets by regression on linkage hierarchies generated from random data

摘要

著录项

相似文献

相关主题

期刊订阅