首页> 外文会议>International Conference on Cloud Computing and Security >MLS-Join: An Efficient MapReduce-Based Algorithm for String Similarity Self-joins with Edit Distance Constraint
【24h】

MLS-Join: An Efficient MapReduce-Based Algorithm for String Similarity Self-joins with Edit Distance Constraint

机译:MLS-JOIN:基于高效的MapReduce的字符串相似性自加入算法,具有编辑距离约束

获取原文

摘要

String similarity joins is an essential operation in data integration. The era of big data calls for scalable algorithms to support large-scale string similarity joins. In this paper, we study scalable string similarity self-joins with edit distance constraint, and a MapReduce based algorithm, called MLS-Join, is proposed to supports similarity self-joins. The proposed self-join algorithm is a filter-verify based method. In filter stage, the existing multi-match-aware select substring scheme is improved to decrease the amount of generated signatures and to eliminate redundant string pairs including self-to-self pairs and duplicate pairs. In verify stage, the dataset is read only once by use of the techniques of positive/reversed pairs and combined key. Experimental results on real-world datasets show that our algorithm significantly outperformed state-of-the-art approaches.
机译:字符串相似性连接是数据集成的重要操作。大数据的时代调用可扩展算法以支持大规模的字符串相似性连接。在本文中,我们研究了使用编​​辑距离约束的可伸缩字符串相似性自连接,并提出了一种名为MLS-Join的基于MapReduce的算法,以支持相似性自联连接。所提出的自控算法是基于滤波器验证的方法。在滤波器阶段,改进了现有的多匹配感知选择子字符串方案以减少生成的签名的量并消除包括自我对和重复对的冗余字符串对。在验证阶段,通过使用正/反对对和组合密钥的技术只读数据集。实验结果对现实世界数据集表明,我们的算法显着优于最先进的方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号