Performance comparison between Hadoop and Spark frameworks using HiBench benchmarks

Yassir Samadi; Mostapha Zbakh; Claude Tadonki

首页> 外文期刊>Concurrency, practice and experience >Performance comparison between Hadoop and Spark frameworks using HiBench benchmarks

【24h】

Performance comparison between Hadoop and Spark frameworks using HiBench benchmarks

机译：使用HiBench基准测试的Hadoop和Spark框架之间的性能比较

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

BigData has become one of themajor areas of research for cloud service providers due to a large amountof data produced everyday and the inefficiencyof traditional algorithms and technologies to handle these large amounts of data. Big Data with its characteristics such as volume, variety, and veracity (3V) requires efficient technologies to process in real time. To solve this problem and to process and analyze this vast amount of data, there are many powerful tools like Hadoop and Spark, which are mainly used in the context of Big Data. They work following the principles of parallel computing. The challenge is to specify which Big Data's tool is better depending on the processing context. In this paper, we present and discuss a performance comparison between two popular BigData frameworks deployed on virtual machines. Hadoop MapReduce and Apache Spark are used to efficiently process a vast amount of data in parallel and distributed mode on large clusters, and both of themsuit forBig Data processing.We also present the execution results of Apache Hadoop in Amazon EC2, a major cloud computing environment. To compare the performance of these two frameworks, we use HiBench benchmark suite, which is an experimental approach for measuring the effectiveness of any computer system. The comparison ismadebased onthree criteria: execution time, throughput, andspeedup.We testWordcount workload with different data sizes for more accurate results.Our experimental results show that the performance of these frameworks varies significantly based on the use case implementation. Furthermore, from our results we draw the conclusion that Spark is more efficient than Hadoop to deal with a large amount of data in major cases. However, Spark requires higher memory allocation, since it loads the data to be processed intomemoryand keeps themincaches for awhile, just like standard databases. So the choice depends on performance level and memory constraints.

机译：由于每天产生的大量数据以及处理这些大量数据的传统算法和技术的效率低下，BigData已成为云服务提供商研究的主要领域之一。具有数量，种类和准确性（3V）等特征的大数据需要高效的技术进行实时处理。为了解决此问题并处理和分析大量数据，有很多功能强大的工具，例如Hadoop和Spark，主要用于大数据环境中。它们按照并行计算的原理工作。面临的挑战是根据处理环境来指定哪种大数据工具更好。在本文中，我们介绍并讨论了虚拟机上部署的两个流行的BigData框架之间的性能比较。 Hadoop MapReduce和Apache Spark用于在大型集群上以并行和分布式模式高效处理大量数据，它们都适合大数据处理。我们还介绍了Amazon Hadoop在主要云计算环境Amazon EC2中的执行结果。。为了比较这两个框架的性能，我们使用了HiBench基准套件，这是一种衡量任何计算机系统有效性的实验方法。根据执行时间，吞吐量和速度这三个标准进行比较。我们测试了具有不同数据大小的Wordcount工作负载，以获得更准确的结果。我们的实验结果表明，这些框架的性能根据用例实现的不同而有很大差异。此外，从我们的结果中得出的结论是，在主要情况下，Spark比Hadoop更有效地处理大量数据。但是，Spark需要更高的内存分配，因为它像标准数据库一样将要处理的数据加载到内存中并保留mincache一段时间。因此，选择取决于性能级别和内存限制。

著录项

来源
《Concurrency, practice and experience》 |2018年第12期|e4367.1-e4367.13|共13页
作者
Yassir Samadi; Mostapha Zbakh; Claude Tadonki;
展开▼
作者单位

National School of Computer Science and Systems Analysis,Mohammed V University, Rabat, Morocco;

National School of Computer Science and Systems Analysis,Mohammed V University, Rabat, Morocco;

MINES ParisTech / CRI, Paris, France;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Amazon EC2; Big Data; cloud computing; Hadoop; HiBench; parallel and distributed processing; Spark;

机译：亚马逊EC2;大数据;云计算;Hadoop;HiBench;并行和分布式处理;火花;

相似文献

外文文献
中文文献
专利

1. Performance Analysis of Hadoop YARN Job Schedulers in a Multi-Tenant Environment on HiBench Benchmark Suite [J] . Kamalakant Laxman Bawankule, Rupesh Kumar Dewang, Anil Kumar Singh International journal of distributed systems and technologies . 2021,第3期

机译：Hadoop纱线在Hibench基准套件多租户环境中Hadoop纱线工作调度仪的性能分析
2. A COMPARISON BETWEEN THE HADOOP AND SPARK DISTRIBUTED FRAMEWORKS IN THE CONTEXT OF REGION-GROWING SEGMENTATION OF REMOTE SENSING IMAGES [J] . R. B. Andrade, J. M. F. Santos, G. A. O. P. Costa, ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences . 2019,第5期

机译：在遥感图像的地区生长分割的背景下Hadoop和Spark分布式框架的比较
3. Performance Optimization System for Hadoop and Spark Frameworks [J] . Hrachya Astsatryan, Aram Kocharyan, Daniel Hagimont, Cybernetics and information technologies: CIT . 2020,第6期

机译：Hadoop和Spark框架的性能优化系统
4. Comparative study between Hadoop and Spark based on Hibench benchmarks [C] . Yassir Samadi, Mostapha Zbakh, Claude Tadonki Proceedings of 2016 International Conference on Cloud Computing Technologies and Applications . 2016

机译：基于Hibench基准测试的Hadoop和Spark之间的比较研究
5. Performance comparison by running benchmarks on Hadoop, Spark, and HAMR. [D] . Liu, Lu. 2015

机译：通过在Hadoop，Spark和HAMR上运行基准测试来进行性能比较。
6. Assessing the performance of Granger–Geweke causality: Benchmark dataset and simulation framework [O] . Mattia F. Pagnotta, Mukesh Dhamala, Gijs Plomp 2018

机译：评估Granger–Geweke因果关系的性能：基准数据集和模拟框架
7. A Comprehensive Performance Analysis of Apache Hadoop and Apache Spark for Large Scale Data Sets Using HiBench [O] . Nasim Ahmed, Andre L. C. Barczak, Teo Susnjak, 2020

机译：使用Hibench的大规模数据集的Apache Hadoop和Apache Spark的全面绩效分析

Performance comparison between Hadoop and Spark frameworks using HiBench benchmarks

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅