【24h】

A Density-Aware Similarity Join Query Processing Algorithm on MapReduce

机译:基于MapReduce的密度感知相似连接查询处理算法

获取原文
获取外文期刊封面目录资料

摘要

Recently, the amount of data is rapidly increasing and thus MapReduce has attracted much interest as a new paradigm for such data-intensive applications. Similarity join is an essential operation for data analytics, including record linkage, near duplicate detection, document clustering. However, the performance of MapReduce is limited when applied on complex data analytical task involving joins of multiple datasets. Hence, workload-aware data partitioning techniques are required, which ensure the balance of computation of each machine. In this paper, we propose a similarity join algorithm using MapReduce that provides scalability and high performance by using grid-based data mapping technique for joining datasets. From the experiment analysis, we prove that our algorithm outperforms the existing algorithm under various data size and similarity thresholds.
机译:最近,数据量正在迅速增加,因此MapReduce作为此类数据密集型应用程序的新范例已引起了广泛的关注。相似联接是数据分析的一项基本操作,包括记录链接,近乎重复的检测,文档聚类。但是,当将MapReduce用于涉及多个数据集的连接的复杂数据分析任务时,其性能会受到限制。因此,需要知道工作负载的数据分区技术,以确保每台计算机的计算平衡。在本文中,我们提出了一种使用MapReduce的相似性联接算法,该算法通过使用基于网格的数据映射技术联接数据集来提供可伸缩性和高性能。通过实验分析,我们证明了在各种数据大小和相似度阈值下,我们的算法优于现有算法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号