首页> 外文期刊>Machine Learning >Clustering with missing features: a penalized dissimilarity measure based approach
【24h】

Clustering with missing features: a penalized dissimilarity measure based approach

机译:具有缺失特征的聚类:基于惩罚性差异度量的方法

获取原文
获取原文并翻译 | 示例

摘要

Many real-world clustering problems are plagued by incomplete data characterized by missing or absent features for some or all of the data instances. Traditional clustering methods cannot be directly applied to such data without preprocessing by imputation or marginalization techniques. In this article, we overcome this drawback by utilizing a penalized dissimilarity measure which we refer to as the feature weighted penalty based dissimilarity (FWPD). Using the FWPD measure, we modify the traditional k-means clustering algorithm and the standard hierarchical agglomerative clustering algorithms so as to make them directly applicable to datasets with missing features. We present time complexity analyses for these new techniques and also undertake a detailed theoretical analysis showing that the new FWPD based k-means algorithm converges to a local optimum within a finite number of iterations. We also present a detailed method for simulating random as well as feature dependent missingness. We report extensive experiments on various benchmark datasets for different types of missingness showing that the proposed clustering techniques have generally better results compared to some of the most well-known imputation methods which are commonly used to handle such incomplete data. We append a possible extension of the proposed dissimilarity measure to the case of absent features (where the unobserved features are known to be undefined).
机译:许多现实世界中的群集问题都受到数据不完整的困扰,这些数据的特征是某些或所有数据实例缺少或缺少功能。如果不通过插补或边缘化技术进行预处理,传统的聚类方法就无法直接应用于此类数据。在本文中,我们通过使用惩罚性相异度度量(称为特征加权惩罚基于相异度(FWPD))来克服此缺点。使用FWPD度量,我们修改了传统的k均值聚类算法和标准的层次聚类聚类算法,以使其直接适用于缺少特征的数据集。我们介绍了这些新技术的时间复杂度分析,并进行了详细的理论分析,表明基于FWPD的新k-means算法在有限的迭代次数内收敛到局部最优。我们还提出了一种模拟随机以及依赖于特征的缺失的详细方法。我们报告了针对各种缺失类型的各种基准数据集进行的大量实验,结果表明,与通常用于处理此类不完整数据的某些最知名插补方法相比,所提出的聚类技术通常具有更好的结果。我们将提议的相异性度量的可能扩展添加到缺少特征的情况下(已知未观察到的特征是未定义的)。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号