首页> 外文会议>IEEE International Conference on Data Engineering >Top-k string similarity search with edit-distance constraints
【24h】

Top-k string similarity search with edit-distance constraints

机译:与编辑距离约束的Top-K字符串相似性搜索

获取原文

摘要

String similarity search is a fundamental operation in many areas, such as data cleaning, information retrieval, and bioinformatics. In this paper we study the problem of top-k string similarity search with edit-distance constraints, which, given a collection of strings and a query string, returns the top-k strings with the smallest edit distances to the query string. Existing methods usually try different edit-distance thresholds and select an appropriate threshold to find top-k answers. However it is rather expensive to select an appropriate threshold. To address this problem, we propose a progressive framework by improving the traditional dynamic-programming algorithm to compute edit distance. We prune unnecessary entries in the dynamic-programming matrix and only compute those pivotal entries. We extend our techniques to support top-k similarity search. We develop a range-based method by grouping the pivotal entries to avoid duplicated computations. Experimental results show that our method achieves high performance, and significantly outperforms state-of-the-art approaches on real-world datasets.
机译:字符串相似性搜索是许多领域的基本操作,例如数据清洁,信息检索和生物信息学。在本文中,我们研究了与编辑距离约束的top-k字符串相似性搜索问题,因为字符串和查询字符串的集合,返回与查询字符串的最小编辑距离的顶-k字符串。现有方法通常会尝试不同的编辑距离阈值,并选择适当的阈值以查找顶部K答案。然而,选择适当的阈值是相当昂贵的。为了解决这个问题,我们通过改进传统的动态编程算法来计算编辑距离来提出渐进框架。我们在动态编程矩阵中修剪不必要的条目,并且仅计算这些关键条目。我们扩展了我们的技术来支持Top-K类似性搜索。通过对关键条目进行分组以避免重复计算,我们开发基于范围的方法。实验结果表明,我们的方法实现了高性能,并且显着优于现实世界数据集的最先进方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号