首页> 外文期刊>Data & Knowledge Engineering >Efficient processing of all-k-nearest-neighbor queries in the MapReduce programming framework
【24h】

Efficient processing of all-k-nearest-neighbor queries in the MapReduce programming framework

机译:MapReduce编程框架中的All-K-interBible邻查询的高效处理

获取原文
获取原文并翻译 | 示例
           

摘要

Numerous modern applications, from social networking to astronomy, need efficient answering of queries on spatial data. One such query is the All k Nearest-Neighbor Query, or k Nearest-Neighbor Join, that takes as input two datasets and, for each object of the first one, returns the k nearest-neighbors from the second one. It is a combination of the k nearest-neighbor and join queries and is computationally demanding. Especially, when the datasets involved fall in the category of Big Data, a single machine cannot efficiently process it. Only in the last few years, papers proposing solutions for distributed computing environments have appeared in the literature. In this paper, we focus on parallel and distributed algorithms using the Apache Hadoop framework. More specifically, we focus on an algorithm that was recently presented in the literature and propose improvements to tackle three major challenges that distributed processing faces: improvement of load balancing (we implement an adaptive partitioning scheme based on Quadtrees), acceleration of local processing (we prune points during calculations by utilizing plane-sweep processing), and reduction of network traffic (we restructure and reduce the output size of the most demanding phase of computation). Moreover, by using real 2D and 3D datasets, we experimentally study the effect of each improvement and their combinations on performance of this literature algorithm. Experiments show that by carefully addressing the three aforementioned issues, one can achieve significantly better performance. Thereby, we conclude to a new scalable algorithm that adapts to the data distribution and significantly outperforms its predecessor. Moreover, we present an experimental comparison of our algorithm against other well-known MapReduce algorithms for the same query and show that these algorithms are also significantly outperformed.
机译:从社交网络到天文学的许多现代应用需要在空间数据上有效地应答查询。一个这样的查询是所有K最近邻查询,或k最近邻居连接,它作为输入两个数据集,并且对于第一个数据集,并且对于第一个的每个对象,返回来自第二个的k最近邻居。它是k最近邻居的组合,并加入查询,并且是计算要求的。特别是,当数据集涉及到大数据类别时,单个机器无法有效地处理它。只在过去几年中,文献中出现了提出用于分布式计算环境的解决方案的论文。在本文中,我们专注于使用Apache Hadoop框架的并行和分布式算法。更具体地说,我们专注于最近在文献中呈现的算法,并提出改进来解决分布式处理面的三个主要挑战:负载平衡的改进(我们基于四仲群地实现了基于四分之一的自适应分区方案),加速本地处理(我们通过利用平面扫描处理来计算计算期间的修剪点),并减少网络流量(我们重组并减少最苛刻的计算阶段的输出大小)。此外,通过使用真实的2D和3D数据集,我们通过实验研究每个改进的效果及其组合对该文献算法性能的影响。实验表明,通过仔细解决三个上述问题,人们可以实现明显更好的性能。因此,我们结束了一种新的可扩展算法,适应数据分布并显着优于其前身。此外,我们向其他众所周知的MapReduce算法提供了对同一查询的实验比较,并表明这些算法也显着优于表现。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号