首页> 外文会议>The 20th International Conference on Advanced Communications Technology >Capacity-aware key partitioning scheme for heterogeneous big data analytic engines
【24h】

Capacity-aware key partitioning scheme for heterogeneous big data analytic engines

机译:异构大数据分析引擎的容量感知密钥分区方案

获取原文
获取原文并翻译 | 示例

摘要

Big data and cloud computing became the centre of interest for the past decade. With the increase of data size and different cloud application, the idea of big data analytics become very popular both in industry and academia. The research communities in industry and academia never stopped trying to come up with the fast, robust, and fault tolerant analytic engines. MapReduce becomes one of the popular big data analytic engine over the past few years. Hadoop is a standard implementation of MapReduce framework for running data-intensive applications on the clusters of commodity servers. By thoroughly studying the framework we find out that the shuffle phase, all-to-all input data fetching phase in reduce task significantly affect the application performance. There is a problem of variance in both the intermediate key's frequencies and their distribution among data nodes throughout the cluster in Hadoop's MapReduce system. This variance in system causes network overhead which leads to unfairness on the reduce input among different data nodes in the cluster. Because of the above problems, applications experience performance degradation due to shuffle phase of MapReduce applications. We develop a new novel algorithm; unlike previous systems our algorithm considers each node's capabilities as heuristics to decide a better available trade-off for the locality and fairness in the system. By comparing with the default Hadoop's partitioning algorithm and Leen partitioning algorithm: a). In case of 2 million key-value pairs to process, on the average our approach achieve better resource utilization by about 19%, and 9%, in that order; b). In case of 3 million key-value pairs to process, our approach achieve near optimal resource utilization by about 15%, and 7%, respectively.
机译:在过去的十年中,大数据和云计算成为人们关注的焦点。随着数据量的增加和不同的云应用程序,大数据分析的想法在行业和学术界都变得非常流行。工业界和学术界的研究机构从未停止尝试提出快速,强大且容错的分析引擎。在过去的几年中,MapReduce成为流行的大数据分析引擎之一。 Hadoop是MapReduce框架的标准实现,用于在商品服务器集群上运行数据密集型应用程序。通过彻底研究该框架,我们发现精简任务中的洗牌阶段,所有输入数据获取阶段均会严重影响应用程序性能。在Hadoop的MapReduce系统中,中间密钥的频率及其在整个集群中数据节点之间的分布都存在差异的问题。系统中的这种差异会导致网络开销,从而导致集群中不同数据节点之间减少输入的不公平性。由于上述问题,由于MapReduce应用程序的混洗阶段,应用程序性能会下降。我们开发了一种新的新颖算法;与以前的系统不同,我们的算法将每个节点的功能视为启发式方法,以便为系统中的局部性和公平性决定更好的可用权衡。通过与默认的Hadoop分区算法和Leen分区算法进行比较:a)。在要处理200万个键值对的情况下,平均而言,我们的方法按此顺序可实现约19%和9%的更好的资源利用率; b)。在要处理300万个键值对的情况下,我们的方法分别实现了大约15%和7%的最佳资源利用率。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号