首页> 外文会议>IEEE international conference on data engineering >Top-k string similarity search with edit-distance constraints
【24h】

Top-k string similarity search with edit-distance constraints

机译:具有编辑距离约束的top-k字符串相似度搜索

获取原文

摘要

String similarity search is a fundamental operation in many areas, such as data cleaning, information retrieval, and bioinformatics. In this paper we study the problem of top-k string similarity search with edit-distance constraints, which, given a collection of strings and a query string, returns the top-k strings with the smallest edit distances to the query string. Existing methods usually try different edit-distance thresholds and select an appropriate threshold to find top-k answers. However it is rather expensive to select an appropriate threshold. To address this problem, we propose a progressive framework by improving the traditional dynamic-programming algorithm to compute edit distance. We prune unnecessary entries in the dynamic-programming matrix and only compute those pivotal entries. We extend our techniques to support top-k similarity search. We develop a range-based method by grouping the pivotal entries to avoid duplicated computations. Experimental results show that our method achieves high performance, and significantly outperforms state-of-the-art approaches on real-world datasets.
机译:字符串相似性搜索是许多领域的基本操作,例如数据清理,信息检索和生物信息学。在本文中,我们研究了具有编辑距离约束的top-k字符串相似性搜索问题,该问题在给定字符串和查询字符串的集合的情况下,将具有最小编辑距离的top-k字符串返回给查询字符串。现有方法通常尝试不同的编辑距离阈值,然后选择适当的阈值以找到前k个答案。然而,选择适当的阈值相当昂贵。为了解决这个问题,我们提出了一个改进的框架,通过改进传统的动态编程算法来计算编辑距离。我们修剪动态编程矩阵中不必要的条目,仅计算那些关键条目。我们扩展了技术以支持top-k相似性搜索。我们通过对关键条目进行分组来避免重复的计算,从而开发了一种基于范围的方法。实验结果表明,我们的方法具有很高的性能,并且在实际数据集上明显优于最新方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号