首页> 外文期刊>Information Systems >An effective candidate generation method for improving performance of edit similarity query processing
【24h】

An effective candidate generation method for improving performance of edit similarity query processing

机译:一种提高编辑相似度查询处理性能的有效候选生成方法

获取原文
获取原文并翻译 | 示例
       

摘要

In this paper, we study edit similarity query processing to find strings similar to a query string from a collection of strings. To solve the problem, many algorithms have been proposed under a filter-and-verification framework, where candidate strings are generated and refined using a few filters and then verified to find true matches. A major focus of those algorithms has been on generating candidates as small as possible in an early stage of the query processing. A typical approach to generate candidates is to extract some signatures from a query and take union of string ids in the inverted lists of the extracted signatures. However, the number of candidates generated from existing techniques is extremely larger than the number of answer strings and costs for refinement and verification are expensive. To address the problem, we propose an intersection-based candidate generation scheme, which generates a substantially smaller number of candidates. Given some signatures of a query, the proposed scheme first categorizes signatures into several groups. Then, it takes intersection of string ids in the inverted lists of the signatures in each group. Finally, it takes union of the intersections to generate candidates. To minimize the number of candidates under our scheme, we propose a novel algorithm which judiciously selects an optimal signature group. We show through experiments that our technique is very effective in reducing the number of candidates and significantly improves the performance.
机译:在本文中,我们研究了编辑相似性查询处理,以从字符串集合中查找与查询字符串相似的字符串。为了解决该问题,已经在过滤和验证框架下提出了许多算法,其中使用一些过滤器生成和精炼候选字符串,然后对其进行验证以找到真正的匹配项。这些算法的主要重点是在查询处理的早期阶段生成尽可能小的候选对象。生成候选者的典型方法是从查询中提取一些签名,并在提取的签名的反向列表中采用字符串ID的并集。然而,从现有技术产生的候选者的数量远大于答案串的数量,并且精炼和验证的成本昂贵。为了解决该问题,我们提出了一种基于交集的候选生成方案,该方案生成的候选数量要少得多。给定查询的某些签名,建议的方案首先将签名分类为几个组。然后,它在每个组的签名倒排列表中获取字符串ID的交集。最后,需要交集并集才能生成候选。为了在我们的方案下最大程度地减少候选者的数量,我们提出了一种新颖的算法,该算法会明智地选择最佳签名组。我们通过实验表明,我们的技术在减少候选人数量方面非常有效,并且可以显着提高性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号