首页> 外文会议>2011 IEEE 27th International Conference on Data Engineering >Fast-join: An efficient method for fuzzy token matching based string similarity join
【24h】

Fast-join: An efficient method for fuzzy token matching based string similarity join

机译:快速连接:基于字符串相似连接的模糊令牌匹配的有效方法

获取原文

摘要

String similarity join that finds similar string pairs between two string sets is an essential operation in many applications, and has attracted significant attention recently in the database community. A significant challenge in similarity join is to implement an effective fuzzy match operation to find all similar string pairs which may not match exactly. In this paper, we propose a new similarity metrics, called “fuzzy token matching based similarity”, which extends token-based similarity functions (e.g., Jaccard similarity and Cosine similarity) by allowing fuzzy match between two tokens. We study the problem of similarity join using this new similarity metrics and present a signature-based method to address this problem. We propose new signature schemes and develop effective pruning techniques to improve the performance. Experimental results show that our approach achieves high efficiency and result quality, and significantly outperforms state-of-the-art methods.
机译:在许多应用程序中,在两个字符串集之间找到相似的字符串对的字符串相似性联接是一项必不可少的操作,并且最近在数据库社区中引起了极大的关注。相似联接中的一个重大挑战是实现有效的模糊匹配操作,以找到所有可能不完全匹配的相似字符串对。在本文中,我们提出了一种新的相似性指标,称为“基于模糊令牌匹配的相似性”,它通过允许两个令牌之间的模糊匹配来扩展基于令牌的相似性函数(例如,Jaccard相似性和余弦相似性)。我们使用这种新的相似性指标研究相似性联接的问题,并提出了一种基于签名的方法来解决此问题。我们提出新的签名方案并开发有效的修剪技术以提高性能。实验结果表明,我们的方法实现了高效率和结果质量,并且明显优于最新方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号