...
首页> 外文期刊>SIGMOD record >Estimating the Selectivity of tf-idf 'based Cosine Similarity Predicates
【24h】

Estimating the Selectivity of tf-idf 'based Cosine Similarity Predicates

机译:估计基于tf-idf'的余弦相似性谓词的选择性

获取原文
获取原文并翻译 | 示例
           

摘要

An increasing number of database applications today require sophisticated approximate string matching capabilities. Examples of such application areas include data integration and data cleaning. Cosine similarity has proven to be a robust metric for scoring the similarity between two strings, and it is increasingly being used in complex queries. An immediate challenge faced by current database optimizers is to find accurate and efficient methods for estimating the selectivity of cosine similarity predicates. To the best of our knowledge, there are no known methods for this problem. In this paper, we present the first approach for estimating the selectivity of tf.idf based cosine similarity predicates. We evaluate our approach on three different real datasets and show that our method often produces estimates that are within 40% of the actual selectivity.
机译:如今,越来越多的数据库应用程序需要复杂的近似字符串匹配功能。这种应用领域的示例包括数据集成和数据清理。事实证明,余弦相似度是衡量两个字符串之间相似度的可靠指标,并且越来越多地用于复杂查询中。当前数据库优化器面临的直接挑战是找到准确有效的方法来估计余弦相似性谓词的选择性。据我们所知,没有已知的方法可以解决此问题。在本文中,我们提出了第一种估计基于tf.idf的余弦相似性谓词的选择性的方法。我们在三个不同的真实数据集上评估了我们的方法,并表明我们的方法通常得出的估计值在实际选择性的40%之内。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号