首页> 外文会议>DEXA 2010;International conference on database and expert systems applications >An Efficient Similarity Join Algorithm with Cosine Similarity Predicate
【24h】

An Efficient Similarity Join Algorithm with Cosine Similarity Predicate

机译:余弦相似度谓词的高效相似度联合算法

获取原文

摘要

Given a large collection of objects, finding all pairs of similar objects, namely similarity join, is widely used to solve various problems in many application domains.Computation time of similarity join is critical issue, since similarity join requires computing similarity values for all possible pairs of objects. Several existing algorithms adopt prefix filtering to avoid unnecessary similarity computation; however, existing algorithms implementing the prefix filtering have inefficiency in filtering out object pairs, in particular, when aggregate weighted similarity function, such as cosine similarity, is used to quantify similarity values between objects. This is mostly caused by large prefixes the algorithms select. In this paper, we propose an alternative method to select small prefixes by exploiting the relationship between arithmetic mean and geometric mean of elements' weights. A new algorithm, MMJoin, implementing the proposed methods dramatically reduces the average size of prefixes without much overhead. Finally, it saves much computation time. We demonstrate that our algorithm outperforms a state-of-the-art one with empirical evaluation on large-scale real world datasets.
机译:给定大量对象,查找所有相似对象对(即相似联接)已广泛用于解决许多应用领域中的各种问题。相似联接的计算时间是关键问题,因为相似联接需要为所有可能的对计算相似值对象。现有的几种算法都采用前缀过滤来避免不必要的相似度计算。然而,现有的实现前缀过滤的算法在过滤出对象对时效率低下,特别是当诸如加权余弦相似度的集合加权相似度函数用于量化对象之间的相似度值时。这主要是由算法选择的较大前缀引起的。在本文中,我们提出了一种利用元素权重的算术平均值和几何平均值之间的关系来选择小前缀的替代方法。实现所提出方法的新算法MMJoin大大减少了前缀的平均大小,而没有太多开销。最后,它节省了很多计算时间。我们通过对大型现实世界数据集进行实证评估,证明了我们的算法优于最新算法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号