首页> 外文会议>International conference on cloud computing and security >MLS-Join: An Efficient MapReduce-Based Algorithm for String Similarity Self-joins with Edit Distance Constraint
【24h】

MLS-Join: An Efficient MapReduce-Based Algorithm for String Similarity Self-joins with Edit Distance Constraint

机译:MLS-Join:一种有效的基于MapReduce的具有编辑距离约束的字符串相似性自联接算法

获取原文

摘要

String similarity joins is an essential operation in data integration. The era of big data calls for scalable algorithms to support large-scale string similarity joins. In this paper, we study scalable string similarity self-joins with edit distance constraint, and a MapReduce based algorithm, called MLS-Join, is proposed to supports similarity self-joins. The proposed self-join algorithm is a filter-verify based method. In filter stage, the existing multi-match-aware select substring scheme is improved to decrease the amount of generated signatures and to eliminate redundant string pairs including self-to-self pairs and duplicate pairs. In verify stage, the dataset is read only once by use of the techniques of positive/reversed pairs and combined key. Experimental results on real-world datasets show that our algorithm significantly outperformed state-of-the-art approaches.
机译:字符串相似性联接是数据集成中必不可少的操作。大数据时代要求可伸缩的算法来支持大规模的字符串相似性联接。在本文中,我们研究了具有编辑距离约束的可伸缩字符串相似性自联接,并提出了一种基于MapReduce的算法MLS-Join,以支持相似性自联接。提出的自连接算法是一种基于过滤验证的方法。在过滤阶段,对现有的多重匹配感知选择子字符串方案进行了改进,以减少生成签名的数量,并消除冗余字符串对,包括自我对和重复对。在验证阶段,通过使用正/反向对和组合键的技术只能读取一次数据集。在真实数据集上的实验结果表明,我们的算法明显优于最新方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号