首页> 外文会议>IEEE international conference on data engineering >Two birds with one stone: An efficient hierarchical framework for top-k and threshold-based string similarity search
【24h】

Two birds with one stone: An efficient hierarchical framework for top-k and threshold-based string similarity search

机译:两只鸟一块石头:高效的基于阈值和基于阈值的字符串相似性搜索的高效分层框架

获取原文
获取外文期刊封面目录资料

摘要

String similarity search is a fundamental operation in data cleaning and integration. It has two variants, threshold-based string similarity search and top-k string similarity search. Existing algorithms are efficient either for the former or the latter; most of them can't support both two variants. To address this limitation, we propose a unified framework. We first recursively partition strings into disjoint segments and build a hierarchical segment tree index (HS-Tree) on top of the segments. Then we utilize the HS-Tree to support similarity search. For threshold-based search, we identify appropriate tree nodes based on the threshold to answer the query and devise an efficient algorithm (HS-Search). For top-k search, we identify promising strings with large possibility to be similar to the query, utilize these strings to estimate an upper bound which is used to prune dissimilar strings, and propose an algorithm (HS-Topk). We also develop effective pruning techniques to further improve the performance. Experimental results on real-world datasets show our method achieves high performance on the two problems and significantly outperforms state-of-the-art algorithms.
机译:字符串相似性搜索是数据清洁和集成的基本操作。它有两个变体,基于阈值的字符串相似性搜索和Top-K字符串相似性搜索。现有算法对于前者或后者效率为高;其中大多数都无法支持两个变体。为了解决这个限制,我们提出了一个统一的框架。我们首先将字符串递归分区为脱节段,并在段的顶部构建分层段树索引(HS树)。然后我们利用HS树来支持相似性搜索。对于基于阈值的搜索,我们根据阈值识别适当的树节点以应答查询并设计有效的算法(HS-Search)。对于Top-K搜索,我们确定有可能与查询类似的有可能性的承诺字符串,利用这些字符串来估计用于修剪异常字符串的上限,并提出一种算法(HS-POPK)。我们还开发了有效的修剪技术,以进一步提高性能。实验结果对现实世界数据集显示我们的方法在两个问题上实现了高性能,并且显着优于最先进的算法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号