首页> 外文会议>Proceedings of the First International Workshop on Keyword Search on Structured Data >Efficient top-k algorithms for fuzzy search in string collections
【24h】

Efficient top-k algorithms for fuzzy search in string collections

机译:字符串集合中用于模糊搜索的高效top-k算法

获取原文

摘要

An approximate search query on a collection of strings finds those strings in the collection that are similar to a given query string, where similarity is defined using a given similarity function such as Jaccard, cosine, and edit distance. Answering approximate queries efficiently is important in many applications such as search engines, data cleaning, query relaxation, and spell checking, where inconsistencies and errors exist in user queries as well as data. In this paper, we study the problem of efficiently computing the best answers to an approximate string query, where the quality of a string is based on both its importance and its similarity to the query string. We first develop a progressive algorithm that answers a ranking query by using the results of several approximate range queries, leveraging existing search techniques. We then develop efficient algorithms for answering ranking queries using indexing structures of gram-based inverted lists. We answer a ranking query by traversingthe inverted lists, pruning and skipping irrelevant string ids, iteratively increasing the pruning and skipping power, and doing early termination. We have conducted extensive experiments on real datasets to evaluate the proposed algorithms and report our findings.
机译:对字符串集合的近似搜索查询会找到集合中与给定查询字符串相似的那些字符串,其中使用给定的相似度函数(例如Jaccard,余弦和编辑距离)定义相似度。在许多应用程序(例如搜索引擎,数据清理,查询松弛和拼写检查)中,有效地回答近似查询非常重要,在这些应用程序中,用户查询和数据中都存在不一致和错误。在本文中,我们研究了有效计算近似字符串查询的最佳答案的问题,其中字符串的质量取决于其重要性和与查询字符串的相似性。我们首先开发一种渐进算法,利用现有的搜索技术,通过使用几个近似范围查询的结果来回答排名查询。然后,我们使用基于gram的反向列表的索引结构开发用于回答排名查询的有效算法。我们通过遍历倒排列表,修剪和跳过不相关的字符串id,迭代地增加修剪和跳过能力以及尽早终止来回答排名查询。我们已经对真实数据集进行了广泛的实验,以评估提出的算法并报告我们的发现。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号