Refining high-frequency-queries-based filter for similarity join

机译：精炼基于高频查询的相似连接滤波器

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Similarity join and similarity search are important for text databases and data cleaning. Filter-and-verification are applied to reduce the processing time for similarity join and similarity search. High-frequency-queries-based filter partitions a dataset according to the similarity between a data record and a chosen high-frequency-query, and these partitions are stored in a similarity table. In the filter process, data in some rows of a similarity table are selected as candidates. Many high-frequency queries can be used to improve the pruning power. However, the time to choose an appropriate high-frequency query ??? i.e. to choose an appropriate similarity table ??? increases with the number of high-frequency queries. This paper proposes a refinement of high-frequency-queries-based filter to reduce the filter time and the number of candidates. To reduce the filter time, inverted lists of high-frequency queries are used to speed up the token counting, which reduces the time for choosing an appropriate similarity table. Binary search in each rows of a similarity table is applied to further eliminate non-candidates. It is shown from the experiments that the refined filter method takes less time and gives better pruning power than the original method.

机译：相似性联接和相似性搜索对于文本数据库和数据清理非常重要。应用过滤和验证可减少相似性联接和相似性搜索的处理时间。基于高频查询的过滤器根据数据记录和所选高频查询之间的相似性对数据集进行分区，并将这些分区存储在相似性表中。在过滤过程中，选择相似性表某些行中的数据作为候选。许多高频查询可用于提高修剪能力。但是，该选择适当的高频查询了吗？？？即选择合适的相似度表???随高频查询数量的增加而增加。本文提出了一种改进的基于高频查询的滤波器，以减少滤波器的时间和候选数。为了减少过滤时间，高频查询的倒排列表用于加快令牌计数，从而减少了选择合适的相似度表的时间。在相似性表的每一行中进行二进制搜索以进一步消除非候选者。从实验中可以看出，改进的过滤器方法比原始方法花费更少的时间并提供了更好的修剪能力。

著录项

来源
《International computer science and engineering conference》|2015年|1-5|共5页
会议地点
作者
Jaruloj Chongstitvatana; Natthee Thitinanrungkit;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
filter-and-verfication approach; high-frequency queries; similarity join;

机译：过滤验证方法;高频查询;相似度联合;

相似文献

外文文献
中文文献
专利

1. Generalizing prefix filtering to improve set similarity joins [J] . Leonardo Andrade Ribeiro, Theo Haerder Information Systems . 2011,第1期

机译：通用化前缀过滤以改善集合相似性联接
2. Bitmap filter: Speeding up exact set similarity joins with bitwise operations [J] . Sandes Edans F. O., Teodoro George L. M., Melo Alba C. M. A. Information Systems . 2020,第Feba期

机译：位图过滤器：通过按位运算加快精确的集合相似性联接
3. A Prefix-Filter based Method for Spatio-Textual Similarity Join [J] . Liu S., Li G., Feng J. Knowledge and Data Engineering, IEEE Transactions on . 2014,第10期

机译：基于前缀过滤器的时空文本相似连接方法
4. Refining high-frequency-queries-based filter for similarity join [C] . Jaruloj Chongstitvatana, Natthee Thitinanrungkit International computer science and engineering conference . 2015

机译：基于高频查询的基于高频查询的滤波器进行相似性连接
5. String Similarity Joins and Search Under Edit Distance [D] . Zhang, Haoyu. 2020

机译：字符串相似性连接和搜索编辑距离
6. Assessment of structural similarity in CT using filtered backprojection and iterative reconstruction: a phantom study with 3D printed lung vessels [O] . Raoul M. S. Joemai, Jacob Geleijns 2017

机译：使用过滤后投影和迭代重建评估CT中的结构相似性：3D打印肺血管的幻像研究
7. LSF-Join: Locality Sensitive Filtering for Distributed All-Pairs Set Similarity Under Skew [O] . Cyrus Rashtchian, Aneesh Sharma, David Woodruff 2020

机译：LSF-Join：用于分布式全对的局部敏感过滤在歪斜下设置相似性

Refining high-frequency-queries-based filter for similarity join

摘要

著录项

相似文献

相关主题

期刊订阅