首页> 外文会议>Machine learning and data mining in pattern recognition >Is the Distance Compression Effect Overstated? Some Theory and Experimentation
【24h】

Is the Distance Compression Effect Overstated? Some Theory and Experimentation

机译:距离压缩效果是否过高?一些理论和实验

获取原文
获取原文并翻译 | 示例

摘要

Previous work in the document clustering literature has shown that the Minkowski-ρ distance metrics are unsuitable for clustering very high dimensional document data. This unsuitability is put down to the effect of "compression" of the distances created using the Minkowski-ρ metrics on high dimensional data. Previous experimental work on distance compression has generally used the performance of clustering algorithms on distances created by the different distance metrics as a proxy for the quality of the distance representations created by those metrics. In order to separate out the effects of distances from the performance of the clustering algorithms we tested the homogeneity of the latent classes with respect to item neighborhoods rather than testing the homogeneity of clustering solutions with respect to latent classes. We show the theoretical relationships between the cosine, correlation, and Euclidean metrics. We posit that some of the performance differential between the cosine and correlation metrics and the Minkowski-ρ metrics is due to the inbuilt normalization of the cosine and correlation metrics. The normalization effect decreases with increasing dimensionality and the distance compression effect increases with increasing dimensionality. For document datasets with dimensionality up to 20,000, the normalization effect dominates the distance compression effect. We propose a methodology for measuring the relative normalization and distance compression effects.
机译:文档聚类文献中的先前工作表明,Minkowski-ρ距离度量不适用于聚类超高维文档数据。这种不合适性归结为使用Minkowski-ρ度量对高维数据创建的距离“压缩”的影响。以前有关距离压缩的实验工作通常使用聚类算法对不同距离度量创建的距离的性能作为这些度量创建的距离表示质量的代理。为了从聚类算法的性能中分离出距离的影响,我们针对项目邻域测试了潜在类的同质性,而不是针对潜在类测试了聚类解决方案的同质性。我们显示了余弦,相关性和欧几里德度量之间的理论关系。我们假定,余弦和相关度量与Minkowski-ρ度量之间的某些性能差异是由于余弦和相关度量的内在归一化导致的。归一化效果随尺寸增加而减小,距离压缩效果随尺寸增加而增加。对于维数高达20,000的文档数据集,归一化效果主导距离压缩效果。我们提出了一种测量相对归一化和距离压缩效应的方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号