首页> 外文期刊>IEEE Transactions on Knowledge and Data Engineering >LS-Join: Local Similarity Join on String Collections
【24h】

LS-Join: Local Similarity Join on String Collections

机译:LS-Join:字符串集合上的局部相似性联接

获取原文
获取原文并翻译 | 示例

摘要

String similarity join, as an essential operation in applications including data integration and data cleaning, has attracted significant attention in the research community. Previous studies focus on global similarity join. In this paper, we study local similarity join with edit distance constraints, which finds string pairs from two string collections that have similar substrings. We study two kinds of local similarity join problems: checking local similar pairs and locating local similar pairs. We first consider the case where if two strings are locally similar to each other, they must share a common gram of a certain length. We show how to do efficient local similarity verification based on a matching gram pair. We propose two pruning techniques and an incremental method to further improve the efficiency of finding matching gram pairs. Then, we devise a method to locate the longest similar substring pair for two local similar strings. We conducted a comprehensive experimental study to evaluate the efficiency of these techniques.
机译:字符串相似性联接作为包括数据集成和数据清理在内的应用程序中的必不可少的操作,已经引起了研究界的极大关注。先前的研究集中于全局相似性联接。在本文中,我们研究具有编辑距离约束的局部相似性连接,该连接从具有相似子字符串的两个字符串集合中找到字符串对。我们研究了两种局部相似性联接问题:检查局部相似对和定位局部相似对。我们首先考虑以下情况:如果两个字符串在本地彼此相似,则它们必须共享一定长度的公用克。我们展示了如何基于匹配的语法对进行有效的局部相似性验证。我们提出两种修剪技术和一种增量方法,以进一步提高找到匹配的语法对的效率。然后,我们设计了一种方法来为两个局部相似字符串找到最长的相似子字符串对。我们进行了全面的实验研究,以评估这些技术的效率。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号