首页> 外文会议>IEEE International Conference on Data Engineering >Scalable Metric Similarity Join Using MapReduce
【24h】

Scalable Metric Similarity Join Using MapReduce

机译:使用MapReduce的可伸缩度量相似度联接

获取原文

摘要

Given two collections of objects, metric similarity join finds all similar pairs of objects according to a particular distance function in metric space. There is an increasing demand to provide a scalable similarity join algorithm which can support efficient query and analytical services in the era of Big Data. In this paper, we propose SMS-Join, a parallel framework to support similarity join in metric space based on the MapReduce paradigm. The overall workflow of SMS-Join is that it first finds some records as pivots in the preprocessing phase and then splits the data into partitions based on them with a map job. Finally the join results are obtained via a reduce job. To ensure load balancing between the partitions, we devise a light-weighted sampling technique to obtain high quality samples while maintaining the high performance. To reduce the partition cost, we develop an iterative partition strategy in the map phase. We implement our framework upon Apache Spark platform and conduct extensive experiments on four real world datasets. The results show that our method significantly outperforms state-of-the-art methods.
机译:给定两个对象集合,度量相似性联接根据度量空间中的特定距离函数查找所有相似的对象对。对提供可扩展的相似性联接算法的需求不断增长,该算法可支持大数据时代的高效查询和分析服务。在本文中,我们提出了SMS-Join,这是一个基于MapReduce范式的支持度量空间中相似性联接的并行框架。 SMS-Join的总体工作流程是,它首先在预处理阶段中找到一些记录作为枢轴,然后使用地图作业根据它们将数据拆分为多个分区。最后,通过reduce作业获得连接结果。为了确保分区之间的负载平衡,我们设计了一种轻量级采样技术,以在保持高性能的同时获得高质量的样本。为了降低分区成本,我们在映射阶段开发了迭代分区策略。我们在Apache Spark平台上实现我们的框架,并在四个真实世界的数据集上进行了广泛的实验。结果表明,我们的方法明显优于最新方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号