首页> 外文期刊>Expert Systems with Application >Data clustering using proximity matrices with missing values
【24h】

Data clustering using proximity matrices with missing values

机译:使用缺少值的接近度矩阵进行数据聚类

获取原文
获取原文并翻译 | 示例
       

摘要

In most applications of data clustering the input data includes vectors describing the location of each data point, from which distances between data points can be calculated and a proximity matrix constructed. In some applications, however, the only available input is the proximity matrix, that is, the distances between each pair of data point. Several clustering algorithms can still be applied, but if the proximity matrix has missing values no standard method is directly applicable. Imputation can be done to replace missing values, but most imputation methods do not apply when only the proximity matrix is available. As a partial solution to fill this gap, we propose the Proximity Matrix Completion (PMC) algorithm. This algorithm assumes that data is missing due to one of two reasons: complete dissimilarity or incomplete observations; and imputes values accordingly. To determine which case applies the data is modeled as a graph and a set of maximum cliques in the graph is found. Overlap between cliques then determines the case and hence the method of imputation for each missing data point. This approach is motivated by an application in plant breeding, where what is needed is to cluster new experimental seed varieties into sets of varieties that interact similarly to the environment, and this application is presented as a case study in the paper. The applicability, limitations and performance of the new algorithm versus other methods of imputation are further studied by applying it to datasets derived from three well-known test datasets. (C) 2019 Elsevier Ltd. All rights reserved.
机译:在数据聚类的大多数应用中,输入数据包括描述每个数据点位置的向量,从这些向量可以计算数据点之间的距离并构造邻近矩阵。但是,在某些应用中,唯一可用的输入是接近矩阵,即每对数据点之间的距离。仍然可以应用几种聚类算法,但是如果接近矩阵缺少值,则无法直接应用标准方法。可以使用插补来替换缺失值,但是当只有邻近矩阵可用时,大多数插补方法都不适用。作为填补这一空白的部分解决方案,我们提出了邻近矩阵完成(PMC)算法。该算法假定由于以下两个原因之一而导致数据丢失:完全不相似或不完全观察;并据此估算值。为了确定哪种情况,将数据建模为图形,并在图形中找到一组最大集团。群体之间的重叠然后确定情况,并因此确定每个丢失的数据点的插补方法。这种方法是由植物育种中的一种应用所激发的,该应用中需要将新的实验种子品种聚类为与环境具有相似相互作用的一组品种,本文以案例研究的形式介绍了这种应用。通过将新算法应用到源自三个著名测试数据集的数据集,进一步研究了该新算法相对于其他插补方法的适用性,局限性和性能。 (C)2019 Elsevier Ltd.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号