首页> 外文期刊>IEEE Transactions on Knowledge and Data Engineering >Similarity Join and Similarity Self-Join Size Estimation in a Streaming Environment
【24h】

Similarity Join and Similarity Self-Join Size Estimation in a Streaming Environment

机译:流环境中的相似连接和相似自连接大小估计

获取原文
获取原文并翻译 | 示例

摘要

We study the problem of similarity self-join and similarity join size estimation in a streaming setting where the goal is to estimate, in one scan of the input and with sublinear space in the input size, the number of record pairs that have a similarity within a given threshold. The problem has many applications in data cleaning and query plan generation, where the cost of a similarity join may be estimated before actually doing the join. On unary input where two records either match or don't match, the problem becomes join and self-join size estimation for which one-pass algorithms are readily available. Our work addresses the problem for $d$d-ary input, for $d geq 1$d >= 1, where the degree of similarity can vary from 1 to $d$d. We show that our proposed algorithm gives an accurate estimate and scales well with the input size. We provide error bounds and time and space costs, and conduct an extensive experimental evaluation of our algorithm, comparing its estimation accuracy to a few competitors, including some multi-pass algorithms. Our results show that given the same space, the proposed algorithm has an order of magnitude less error for a large range of similarity thresholds.
机译:我们研究一种流设置中的相似性自联接和相似性联接大小估计问题,该流目标的目的是在一次输入扫描中以及在输入大小中具有亚线性空间的情况下,估计在其中具有相似性的记录对的数量给定的阈值。该问题在数据清理和查询计划生成中有许多应用,其中相似连接的成本可以在实际进行连接之前进行估算。在两个记录匹配或不匹配的一元输入上,问题就变成了联接和自联接大小估计,而单次通过算法很容易获得。我们的工作解决了$ d $ dary输入的问题,即$ d geq 1 $ d> = 1,其中相似度可以从1到$ d $ d不等。我们证明了我们提出的算法给出了准确的估计值,并且可以随输入大小进行很好地缩放。我们提供误差范围以及时间和空间成本,并对我们的算法进行广泛的实验评估,将其估算精度与一些竞争对手(包括一些多遍算法)进行比较。我们的结果表明,在相同的空间下,对于大范围的相似性阈值,所提出的算法的误差要小一个数量级。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号