首页> 外文会议>IEEE international conference on data engineering >MassJoin: A mapreduce-based method for scalable string similarity joins
【24h】

MassJoin: A mapreduce-based method for scalable string similarity joins

机译:MassJoin:基于Mapreduce的可伸缩字符串相似性联接方法

获取原文
获取外文期刊封面目录资料

摘要

String similarity join is an essential operation in data integration. The era of big data calls for scalable algorithms to support large-scale string similarity joins. In this paper, we study scalable string similarity joins using MapReduce. We propose a MapReduce-based framework, called MASSJOIN, which supports both set-based similarity functions and character-based similarity functions. We extend the existing partition-based signature scheme to support set-based similarity functions. We utilize the signatures to generate key-value pairs. To reduce the transmission cost, we merge key-value pairs to significantly reduce the number of key-value pairs, from cubic to linear complexity, while not sacrificing the pruning power. To improve the performance, we incorporate “light-weight” filter units into the key-value pairs which can be utilized to prune large number of dissimilar pairs without significantly increasing the transmission cost. Experimental results on real-world datasets show that our method significantly outperformed state-of-the-art approaches.
机译:字符串相似性连接是数据集成中必不可少的操作。大数据时代要求可伸缩的算法来支持大规模的字符串相似性联接。在本文中,我们使用MapReduce研究可伸缩的字符串相似性联接。我们提出了一个基于MapReduce的框架,称为MASSJOIN,它同时支持基于集合的相似度函数和基于字符的相似度函数。我们扩展了现有的基于分区的签名方案,以支持基于集合的相似性功能。我们利用签名来生成键值对。为了降低传输成本,我们合并了键值对以显着减少键值对的数量(从三次复杂度到线性复杂度),同时又不牺牲修剪能力。为了提高性能,我们将“轻量级”过滤器单元合并到键值对中,这些键值对可用于修剪大量不相似的对,而不会显着增加传输成本。在真实数据集上的实验结果表明,我们的方法明显优于最新方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号