首页> 外文会议>International Conference on Similarity Search and Applications >A Multivariate Correlation Distance for Vector Spaces
【24h】

A Multivariate Correlation Distance for Vector Spaces

机译:矢量空间的多变量相关距离

获取原文

摘要

We investigate a distance metric, previously defined for the measurement of structured data, in the more general context of vector spaces. The metric has a basis in information theory and assesses the distance between two vectors in terms of their relative information content. The resulting metric gives an outcome based on the dimensional correlation, rather than magnitude, of the input vectors, in a manner similar to Cosine Distance. In this paper the metric is defined, and assessed, in comparison with Cosine Distance, for its major properties: semantics, properties for use within similarity search, and evaluation efficiency. We find that it is fairly well correlated with Cosine Distance in dense spaces, but its semantics are in some cases preferable. In a sparse space, it significantly outperforms Cosine Distance over TREC data and queries, the only large collection for which we have a human-ratified ground truth. This result is backed up by another experiment over movielens data. In dense Cartesian spaces it has better properties for use with similarity indices than either Cosine or Euclidean Distance. In its definitional form it is very expensive to evaluate for high-dimensional sparse vectors; to counter this, we show an algebraic rewrite which allows its evaluation to be performed more efficiently. Overall, when a multivariate correlation metric is required over positive vectors, SED seems to be a better choice than Cosine Distance in many circumstances.
机译:我们在更多的矢量空间的常规背景下调查以前定义用于测量结构化数据的距离度量。该度量具有信息理论的基础,并在其相对信息内容方面评估两个向量之间的距离。得到的度量标准以类似于余弦距离的方式基于输入向量的尺寸相关而不是幅度的尺寸相关的结果。在本文中,与余弦距离相比,定义了度量并评估其主要属性:语义,在相似性搜索范围内使用的性能,以及评估效率。我们发现与密集空间中的余弦距离相当好,但其语义在某些情况下优选。在稀疏空间中,它显着优于TREC数据和查询的余弦距离,我们有一个人为批准的地面真理的唯一大集合。该结果由其他实验备份,通过Movielens数据。在密集的笛卡尔空间中,它具有比余弦或欧几里德距离的相似性指数使用的更好的属性。在其定义形式中,评估高维稀疏载体是非常昂贵的;为了计数器,我们展示了一个代数重写,允许更有效地进行评估。总的来说,当在正向载体上需要多变量相关度量时,在许多情况下,SED似乎比余弦距离更好。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号