Similarity Join and Similarity Self-Join Size Estimation in a Streaming Environment

首页> 外文期刊>IEEE Transactions on Knowledge and Data Engineering >Similarity Join and Similarity Self-Join Size Estimation in a Streaming Environment

【24h】

Similarity Join and Similarity Self-Join Size Estimation in a Streaming Environment

机译：流环境中的相似连接和相似自连接大小估计

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

We study the problem of similarity self-join and similarity join size estimation in a streaming setting where the goal is to estimate, in one scan of the input and with sublinear space in the input size, the number of record pairs that have a similarity within a given threshold. The problem has many applications in data cleaning and query plan generation, where the cost of a similarity join may be estimated before actually doing the join. On unary input where two records either match or don't match, the problem becomes join and self-join size estimation for which one-pass algorithms are readily available. Our work addresses the problem for $d$d-ary input, for $d geq 1$d >= 1, where the degree of similarity can vary from 1 to $d$d. We show that our proposed algorithm gives an accurate estimate and scales well with the input size. We provide error bounds and time and space costs, and conduct an extensive experimental evaluation of our algorithm, comparing its estimation accuracy to a few competitors, including some multi-pass algorithms. Our results show that given the same space, the proposed algorithm has an order of magnitude less error for a large range of similarity thresholds.

机译：我们研究一种流设置中的相似性自联接和相似性联接大小估计问题，该流目标的目的是在一次输入扫描中以及在输入大小中具有亚线性空间的情况下，估计在其中具有相似性的记录对的数量给定的阈值。该问题在数据清理和查询计划生成中有许多应用，其中相似连接的成本可以在实际进行连接之前进行估算。在两个记录匹配或不匹配的一元输入上，问题就变成了联接和自联接大小估计，而单次通过算法很容易获得。我们的工作解决了$ d $ dary输入的问题，即$ d geq 1 $ d> = 1，其中相似度可以从1到$ d $ d不等。我们证明了我们提出的算法给出了准确的估计值，并且可以随输入大小进行很好地缩放。我们提供误差范围以及时间和空间成本，并对我们的算法进行广泛的实验评估，将其估算精度与一些竞争对手（包括一些多遍算法）进行比较。我们的结果表明，在相同的空间下，对于大范围的相似性阈值，所提出的算法的误差要小一个数量级。

著录项

来源
《IEEE Transactions on Knowledge and Data Engineering》 |2020年第4期|768-781|共14页
作者

展开▼
作者单位

Univ Alberta Dept Comp Sci Edmonton AB T6G 2E1 Canada;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Selectivity estimation; similarity join; size estimation; one pass algorithm; streaming data;

机译：选择性估计;相似连接尺寸估算;一遍算法流数据;

相似文献

外文文献
中文文献
专利

1. Accelerating the similarity self-join using the GPU [J] . Michael Gowanlock, Ben Karsin Journal of Parallel and Distributed Computing . 2019,第Nova期

机译：使用GPU加速相似性自联接
2. Similarity Based Join Over Audio Feeds in a Multimedia Data Stream Management System [J] . Rafal Maison, Ewelina Majda, Andrzej P. Dobrowolski, Bell Labs technical journal . 2013,第1期

机译：多媒体数据流管理系统中基于相似度的音频馈送联接
3. Similarity Join Processing on Uncertain Data Streams [J] . Lian Xiang, Chen Lei Knowledge and Data Engineering, IEEE Transactions on . 2011,第11期

机译：不确定数据流上的相似联接处理
4. Streaming Similarity Self-Join [C] . Aristides Gionis, Aalto University International conference on very large data bases . 2016

机译：流相似自加入
5. String Similarity Joins and Search Under Edit Distance [D] . Zhang, Haoyu. 2020

机译：字符串相似性连接和搜索编辑距离
6. Homoplasy corrected estimation of genetic similarity from AFLP bands and the effect of the number of bands on the precision of estimation [O] . Gerrit Gort, Theo van Hintum, Fred van Eeuwijk -1

机译：同质异体校正从AFLP频段进行的遗传相似性估算以及频段数量对估算精度的影响
7. Streaming Similarity Self-Join [O] . Morales, Gianmarco De Francisci, Gionis, Aristides 2016

机译：流式相似性自我加入

Similarity Join and Similarity Self-Join Size Estimation in a Streaming Environment

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅