首页> 外文会议>International conference on similarity search and applications >A Multivariate Correlation Distance for Vector Spaces
【24h】

A Multivariate Correlation Distance for Vector Spaces

机译:向量空间的多元相关距离

获取原文

摘要

We investigate a distance metric, previously defined for the measurement of structured data, in the more general context of vector spaces. The metric has a basis in information theory and assesses the distance between two vectors in terms of their relative information content. The resulting metric gives an outcome based on the dimensional correlation, rather than magnitude, of the input vectors, in a manner similar to Cosine Distance. In this paper the metric is defined, and assessed, in comparison with Cosine Distance, for its major properties: semantics, properties for use within similarity search, and evaluation efficiency. We find that it is fairly well correlated with Cosine Distance in dense spaces, but its semantics are in some cases preferable. In a sparse space, it significantly outperforms Cosine Distance over TREC data and queries, the only large collection for which we have a human-ratified ground truth. This result is backed up by another experiment over movielens data. In dense Cartesian spaces it has better properties for use with similarity indices than either Cosine or Euclidean Distance. In its definitional form it is very expensive to evaluate for high-dimensional sparse vectors; to counter this, we show an algebraic rewrite which allows its evaluation to be performed more efficiently. Overall, when a multivariate correlation metric is required over positive vectors, SED seems to be a better choice than Cosine Distance in many circumstances.
机译:我们将在更一般的向量空间环境下研究一种距离度量标准,该距离度量标准先前是为测量结构化数据而定义的。该度量标准具有信息论的基础,并根据它们的相对信息内容来评估两个向量之间的距离。所得度量以类似于余弦距离的方式基于尺寸相关性而不是输入矢量的大小给出结果。与余弦距离相比,本文定义并评估了该度量标准的主要属性:语义,相似性搜索中使用的属性以及评估效率。我们发现它在稠密空间中与余弦距离有很好的相关性,但是在某些情况下它的语义是可取的。在稀疏的空间中,它比TREC数据和查询的余弦距离明显胜过余弦距离,这是我们拥有人类认可的地面真理的唯一大集合。此结果得到另一个有关Movielens数据的实验的支持。在稠密的笛卡尔空间中,它与余弦或欧几里得距离相比,具有更好的与相似性指数一起使用的特性。以其定义形式,评估高维稀疏向量非常昂贵;为了解决这个问题,我们展示了代数重写,可以更有效地执行其评估。总的来说,当在正向量上需要多元相关度量时,在许多情况下,SED似乎是比余弦距离更好的选择。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号