MassJoin: A mapreduce-based method for scalable string similarity joins

机译：MassJoin：基于Mapreduce的可伸缩字符串相似性联接方法

获取原文

获取外文期刊封面目录资料

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

String similarity join is an essential operation in data integration. The era of big data calls for scalable algorithms to support large-scale string similarity joins. In this paper, we study scalable string similarity joins using MapReduce. We propose a MapReduce-based framework, called MASSJOIN, which supports both set-based similarity functions and character-based similarity functions. We extend the existing partition-based signature scheme to support set-based similarity functions. We utilize the signatures to generate key-value pairs. To reduce the transmission cost, we merge key-value pairs to significantly reduce the number of key-value pairs, from cubic to linear complexity, while not sacrificing the pruning power. To improve the performance, we incorporate “light-weight” filter units into the key-value pairs which can be utilized to prune large number of dissimilar pairs without significantly increasing the transmission cost. Experimental results on real-world datasets show that our method significantly outperformed state-of-the-art approaches.

机译：字符串相似性连接是数据集成中必不可少的操作。大数据时代要求可伸缩的算法来支持大规模的字符串相似性联接。在本文中，我们使用MapReduce研究可伸缩的字符串相似性联接。我们提出了一个基于MapReduce的框架，称为MASSJOIN，它同时支持基于集合的相似度函数和基于字符的相似度函数。我们扩展了现有的基于分区的签名方案，以支持基于集合的相似性功能。我们利用签名来生成键值对。为了降低传输成本，我们合并了键值对以显着减少键值对的数量（从三次复杂度到线性复杂度），同时又不牺牲修剪能力。为了提高性能，我们将“轻量级”过滤器单元合并到键值对中，这些键值对可用于修剪大量不相似的对，而不会显着增加传输成本。在真实数据集上的实验结果表明，我们的方法明显优于最新方法。

著录项

来源
《IEEE international conference on data engineering》|2014年|340-351|共12页
会议地点
作者
Deng Dong; Li Guoliang; Hao Shuang; Wang Jiannan;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Para-Join: an efficient parallel method for string similarity join [J] . Cairong Yan, Jian Wang, Bin Zhu, International Journal of High Performance Computing and Networking . 2017,第4a5期

机译：Para-Join：字符串相似性连接的有效并行方法
2. Trie-join: a trie-based method for efficient string similarity joins [J] . Jianhua Feng, Jiannan Wang, Guoliang Li The VLDB journal . 2012,第4期

机译：Trie-join：基于Trie的有效字符串相似性连接方法
3. Efficient and Scalable Processing of String Similarity Join [J] . Rong, Chuitian, Lu, IEEE Transactions on Knowledge and Data Engineering . 2013,第10期

机译：字符串相似连接的高效可扩展处理
4. MassJoin: A mapreduce-based method for scalable string similarity joins [C] . Deng Dong, Li Guoliang, Hao Shuang, IEEE international conference on data engineering . 2014

机译：Massjoin：一种基于MapReduce的可伸缩字符串相似性的方法
5. String Similarity Joins and Search Under Edit Distance [D] . Zhang, Haoyu. 2020

机译：字符串相似性连接和搜索编辑距离
6. Efficient string similarity join in multi-core and distributed systems [O] . Cairong Yan, Xue Zhao, Qinglong Zhang, 2012

机译：多核和分布式系统中的有效字符串相似性联接
7. MassJoin: A MapReduce-based Method for Scalable String Similarity Joins [O] . 2014

机译：massJoin：一种基于mapReduce的可扩展字符串相似性连接方法
8. Similarity analysis applied to the design of scaled tests of hydraulic mitigation methods for Tank 241-SY-101 [R] . Liljegren, L M 1993

机译：相似性分析应用于坦克241-sY-101液压缓解方法的比例测试设计

MassJoin: A mapreduce-based method for scalable string similarity joins

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅