首页> 外文会议>International computer science and engineering conference >Refining high-frequency-queries-based filter for similarity join
【24h】

Refining high-frequency-queries-based filter for similarity join

机译:基于高频查询的基于高频查询的滤波器进行相似性连接

获取原文

摘要

Similarity join and similarity search are important for text databases and data cleaning. Filter-and-verification are applied to reduce the processing time for similarity join and similarity search. High-frequency-queries-based filter partitions a dataset according to the similarity between a data record and a chosen high-frequency-query, and these partitions are stored in a similarity table. In the filter process, data in some rows of a similarity table are selected as candidates. Many high-frequency queries can be used to improve the pruning power. However, the time to choose an appropriate high-frequency query ??? i.e. to choose an appropriate similarity table ??? increases with the number of high-frequency queries. This paper proposes a refinement of high-frequency-queries-based filter to reduce the filter time and the number of candidates. To reduce the filter time, inverted lists of high-frequency queries are used to speed up the token counting, which reduces the time for choosing an appropriate similarity table. Binary search in each rows of a similarity table is applied to further eliminate non-candidates. It is shown from the experiments that the refined filter method takes less time and gives better pruning power than the original method.
机译:相似之处和相似性搜索对于文本数据库和数据清洁非常重要。应用过滤器和验证以减少相似性连接和相似性搜索的处理时间。基于高频查询的过滤器根据数据记录和所选择的高频查询之间的相似性分区数据集,并且这些分区存储在相似性表中。在过滤过程中,选择一些相似性表中的数据作为候选。许多高频查询可用于提高修剪功率。但是,时间选择适当的高频查询???即选择适当的相似性表???随着高频查询的数量增加。本文提出了一种改进基于高频查询的滤波器,以减少过滤时间和候选数量。为了减少滤波器时间,使用倒置的高频查询列表来加速令牌计数,这减少了选择适当的相似性表的时间。在每个行的相似性表中的二进制搜索应用于进一步消除非候选者。从实验显示的是,精炼过滤方法需要更少的时间并提供比原始方法更好的修剪功率。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号