...
首页> 外文期刊>Knowledge and Data Engineering, IEEE Transactions on >Information-Theoretic Distance Measures for Clustering Validation: Generalization and Normalization
【24h】

Information-Theoretic Distance Measures for Clustering Validation: Generalization and Normalization

机译:聚类验证的信息理论距离度量:泛化和归一化

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

This paper studies the generalization and normalization issues of information-theoretic distance measures for clustering validation. Along this line, we first introduce a uniform representation of distance measures, defined as quasi-distance, which is induced based on a general form of conditional entropy. The quasi-distance possesses three properties: symmetry, the triangle law, and the minimum reachable. These properties ensure that the quasi-distance naturally lends itself as the external measure for clustering validation. In addition, we observe that the ranges of the distance measures are different when they apply for clustering validation on different data sets. Therefore, when comparing the performances of clustering algorithms on different data sets, distance normalization is required to equalize ranges of the distance measures. A critical challenge for distance normalization is to obtain the ranges of a distance measure when a data set is provided. To that end, we theoretically analyze the computation of the maximum value of a distance measure for a data set. Finally, we compare the performances of the partition clustering algorithm K-means on various real-world data sets. The experiments show that the normalized distance measures have better performance than the original distance measures when comparing clusterings of different data sets. Also, the normalized Shannon distance has the best performance among four distance measures under study.
机译:本文研究了用于聚类验证的信息理论距离度量的推广和归一化问题。沿着这条线,我们首先引入距离量度的统一表示形式,它被定义为准距离,它是根据条件熵的一般形式导出的。准距离具有三个属性:对称性,三角定律和最小可达性。这些特性确保准距离自然适合作为聚类验证的外部度量。此外,我们观察到距离度量的范围在应用于不同数据集的聚类验证时是不同的。因此,在比较聚类算法在不同数据集上的性能时,需要进行距离归一化以均衡距离度量的范围。距离归一化的关键挑战是在提供数据集时获取距离度量的范围。为此,我们从理论上分析了数据集距离度量最大值的计算。最后,我们比较了分区聚类算法K-means在各种实际数据集上的性能。实验表明,当比较不同数据集的聚类时,归一化距离度量比原始距离度量具有更好的性能。同样,在研究的四个距离度量中,归一化的香农距离具有最佳性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号