Capacity-aware key partitioning scheme for heterogeneous big data analytic engines

机译：异构大数据分析引擎的容量感知密钥分区方案

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

Big data and cloud computing became the centre of interest for the past decade. With the increase of data size and different cloud application, the idea of big data analytics become very popular both in industry and academia. The research communities in industry and academia never stopped trying to come up with the fast, robust, and fault tolerant analytic engines. MapReduce becomes one of the popular big data analytic engine over the past few years. Hadoop is a standard implementation of MapReduce framework for running data-intensive applications on the clusters of commodity servers. By thoroughly studying the framework we find out that the shuffle phase, all-to-all input data fetching phase in reduce task significantly affect the application performance. There is a problem of variance in both the intermediate key's frequencies and their distribution among data nodes throughout the cluster in Hadoop's MapReduce system. This variance in system causes network overhead which leads to unfairness on the reduce input among different data nodes in the cluster. Because of the above problems, applications experience performance degradation due to shuffle phase of MapReduce applications. We develop a new novel algorithm; unlike previous systems our algorithm considers each node's capabilities as heuristics to decide a better available trade-off for the locality and fairness in the system. By comparing with the default Hadoop's partitioning algorithm and Leen partitioning algorithm: a). In case of 2 million key-value pairs to process, on the average our approach achieve better resource utilization by about 19%, and 9%, in that order; b). In case of 3 million key-value pairs to process, our approach achieve near optimal resource utilization by about 15%, and 7%, respectively.

机译：在过去的十年中，大数据和云计算成为人们关注的焦点。随着数据量的增加和不同的云应用程序，大数据分析的想法在行业和学术界都变得非常流行。工业界和学术界的研究机构从未停止尝试提出快速，强大且容错的分析引擎。在过去的几年中，MapReduce成为流行的大数据分析引擎之一。 Hadoop是MapReduce框架的标准实现，用于在商品服务器集群上运行数据密集型应用程序。通过彻底研究该框架，我们发现精简任务中的洗牌阶段，所有输入数据获取阶段均会严重影响应用程序性能。在Hadoop的MapReduce系统中，中间密钥的频率及其在整个集群中数据节点之间的分布都存在差异的问题。系统中的这种差异会导致网络开销，从而导致集群中不同数据节点之间减少输入的不公平性。由于上述问题，由于MapReduce应用程序的混洗阶段，应用程序性能会下降。我们开发了一种新的新颖算法；与以前的系统不同，我们的算法将每个节点的功能视为启发式方法，以便为系统中的局部性和公平性决定更好的可用权衡。通过与默认的Hadoop分区算法和Leen分区算法进行比较：a）。在要处理200万个键值对的情况下，平均而言，我们的方法按此顺序可实现约19％和9％的更好的资源利用率； b）。在要处理300万个键值对的情况下，我们的方法分别实现了大约15％和7％的最佳资源利用率。

著录项

来源
《The 20th International Conference on Advanced Communications Technology》|2018年|1-1|共1页
会议地点 Chuncheon-si Gangwon-do(KR)
作者
Muhammad Hanif; Choonhwa Lee;
展开▼
作者单位

Division of Computer Science and Engineering, Hanyang University, Seoul, Republic of Korea;

Division of Computer Science and Engineering, Hanyang University, Seoul, Republic of Korea;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Analytical Communication Performance Models as a metric in the partitioning of data-parallel kernels on heterogeneous platforms [J] . Rico-Gallego Juan A., Diaz-Martin Juan C., Calvo-Jurado Carmen, Journal of supercomputing . 2019,第3期

机译：分析通信性能模型作为异构平台上数据并行内核分区的度量
2. Towards high performance data analytic on heterogeneous many-core systems: A study on Bayesian Sequential Partitioning [J] . Lai Bo-Cheng, Wu Tung-Yu, Chiu Tsou-Han, Journal of Parallel and Distributed Computing . 2018,第DECa期

机译：面向异构多核系统上的高性能数据分析：贝叶斯顺序分区研究
3. A Capacity-Aware Thread Scheduling Method Combined with Cache Partitioning to Reduce Inter-Thread Cache Conflicts [J] . Masayuki SATO, Ryusuke EGAWA, Hiroyuki TAKIZAWA, IEICE transactions on information and systems . 2013,第9期

机译：结合缓存分区的容量感知线程调度方法，以减少线程间缓存冲突
4. Capacity-aware key partitioning scheme for heterogeneous big data analytic engines [C] . Muhammad Hanif, Choonhwa Lee International Conference on Advanced Communications Technology . 2018

机译：异构大数据分析发动机的容量感知密钥分区方案
5. AN OVERALL STRUCTURE OF THE DISTRIBUTION AUTOMATION SYSTEM: A CAE APPLICATION (ONE KEYSTROKE LOAD FLOW PROCESS, DATABASE-MAP ENGINEERING, INTERACTIVE DB-MAP APPLICATION, FELDER BALANCE SCHEME). [D] . LIN, WHEI-MIN. 1985

机译：配电自动化系统的总体结构：CAE应用程序（一个按键负载流程，数据库映射工程，交互式DB-MAP应用程序，费尔德平衡方案）。
6. Identification of global data and partitioning scheme for modeling biological data within the electronic medical record. [O] . H. Doller, L. L. Peterson 2000

机译：全局数据的标识和用于对电子病历中的生物数据进行建模的分区方案。
7. Towards high performance data analytic on heterogeneous many-core systems: A study on Bayesian Sequential Partitioning [O] . Bo-Cheng Lai, Tung-Yu Wu, Tsou-Han Chiu, 2018

机译：对异构多核系统的高性能数据分析：贝叶斯连续分区研究

Capacity-aware key partitioning scheme for heterogeneous big data analytic engines

摘要

著录项

相似文献

相关主题

期刊订阅