首页> 外文期刊>Data & Knowledge Engineering >Efficient processing of all-k-nearest-neighbor queries in the MapReduce programming framework
【24h】

Efficient processing of all-k-nearest-neighbor queries in the MapReduce programming framework

机译:在MapReduce编程框架中高效处理所有k个最近邻查询

获取原文
获取原文并翻译 | 示例
           

摘要

Numerous modern applications, from social networking to astronomy, need efficient answering of queries on spatial data. One such query is the All k Nearest-Neighbor Query, or k Nearest-Neighbor Join, that takes as input two datasets and, for each object of the first one, returns the k nearest-neighbors from the second one. It is a combination of the k nearest-neighbor and join queries and is computationally demanding. Especially, when the datasets involved fall in the category of Big Data, a single machine cannot efficiently process it. Only in the last few years, papers proposing solutions for distributed computing environments have appeared in the literature. In this paper, we focus on parallel and distributed algorithms using the Apache Hadoop framework. More specifically, we focus on an algorithm that was recently presented in the literature and propose improvements to tackle three major challenges that distributed processing faces: improvement of load balancing (we implement an adaptive partitioning scheme based on Quadtrees), acceleration of local processing (we prune points during calculations by utilizing plane-sweep processing), and reduction of network traffic (we restructure and reduce the output size of the most demanding phase of computation). Moreover, by using real 2D and 3D datasets, we experimentally study the effect of each improvement and their combinations on performance of this literature algorithm. Experiments show that by carefully addressing the three aforementioned issues, one can achieve significantly better performance. Thereby, we conclude to a new scalable algorithm that adapts to the data distribution and significantly outperforms its predecessor. Moreover, we present an experimental comparison of our algorithm against other well-known MapReduce algorithms for the same query and show that these algorithms are also significantly outperformed.
机译:从社交网络到天文学,许多现代应用程序都需要对空间数据查询的有效回答。这样的查询就是“全部k个最近邻居查询”或“ k个最近邻居联接”,它以两个数据集作为输入,对于第一个对象的每个对象,返回第二个数据集的k个最近邻居。它是k个最邻近查询和联接查询的组合,并且对计算的要求很高。特别是,当涉及的数据集属于大数据类别时,一台机器无法有效地对其进行处理。仅在最近几年中,有关分布式计算环境解决方案的论文才出现在文献中。在本文中,我们重点介绍使用Apache Hadoop框架的并行和分布式算法。更具体地说,我们专注于最近在文献中提出的算法,并提出改进措施以解决分布式处理面临的三个主要挑战:负载平衡的改进(我们实现了基于四叉树的自适应分区方案),本地处理的加速(我们通过利用平面扫描处理在计算过程中修剪点,并减少网络流量(我们调整结构并减少最苛刻计算阶段的输出大小)。此外,通过使用真实的2D和3D数据集,我们实验研究了每种改进及其组合对该文献算法性能的影响。实验表明,通过仔细解决上述三个问题,可以显着提高性能。因此,我们得出了一种新的可伸缩算法,该算法可适应数据分布并明显优于其前身。此外,对于相同的查询,我们将我们的算法与其他著名的MapReduce算法进行实验比较,并表明这些算法的性能也明显优于其他算法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号