首页> 外文期刊>Concurrency, practice and experience >Performance comparison between Hadoop and Spark frameworks using HiBench benchmarks
【24h】

Performance comparison between Hadoop and Spark frameworks using HiBench benchmarks

机译:使用HiBench基准测试的Hadoop和Spark框架之间的性能比较

获取原文
获取原文并翻译 | 示例

摘要

BigData has become one of themajor areas of research for cloud service providers due to a large amountof data produced everyday and the inefficiencyof traditional algorithms and technologies to handle these large amounts of data. Big Data with its characteristics such as volume, variety, and veracity (3V) requires efficient technologies to process in real time. To solve this problem and to process and analyze this vast amount of data, there are many powerful tools like Hadoop and Spark, which are mainly used in the context of Big Data. They work following the principles of parallel computing. The challenge is to specify which Big Data's tool is better depending on the processing context. In this paper, we present and discuss a performance comparison between two popular BigData frameworks deployed on virtual machines. Hadoop MapReduce and Apache Spark are used to efficiently process a vast amount of data in parallel and distributed mode on large clusters, and both of themsuit forBig Data processing.We also present the execution results of Apache Hadoop in Amazon EC2, a major cloud computing environment. To compare the performance of these two frameworks, we use HiBench benchmark suite, which is an experimental approach for measuring the effectiveness of any computer system. The comparison ismadebased onthree criteria: execution time, throughput, andspeedup.We testWordcount workload with different data sizes for more accurate results.Our experimental results show that the performance of these frameworks varies significantly based on the use case implementation. Furthermore, from our results we draw the conclusion that Spark is more efficient than Hadoop to deal with a large amount of data in major cases. However, Spark requires higher memory allocation, since it loads the data to be processed intomemoryand keeps themincaches for awhile, just like standard databases. So the choice depends on performance level and memory constraints.
机译:由于每天产生的大量数据以及处理这些大量数据的传统算法和技术的效率低下,BigData已成为云服务提供商研究的主要领域之一。具有数量,种类和准确性(3V)等特征的大数据需要高效的技术进行实时处理。为了解决此问题并处理和分析大量数据,有很多功能强大的工具,例如Hadoop和Spark,主要用于大数据环境中。它们按照并行计算的原理工作。面临的挑战是根据处理环境来指定哪种大数据工具更好。在本文中,我们介绍并讨论了虚拟机上部署的两个流行的BigData框架之间的性能比较。 Hadoop MapReduce和Apache Spark用于在大型集群上以并行和分布式模式高效处理大量数据,它们都适合大数据处理。我们还介绍了Amazon Hadoop在主要云计算环境Amazon EC2中的执行结果。 。为了比较这两个框架的性能,我们使用了HiBench基准套件,这是一种衡量任何计算机系统有效性的实验方法。根据执行时间,吞吐量和速度这三个标准进行比较。我们测试了具有不同数据大小的Wordcount工作负载,以获得更准确的结果。我们的实验结果表明,这些框架的性能根据用例实现的不同而有很大差异。此外,从我们的结果中得出的结论是,在主要情况下,Spark比Hadoop更有效地处理大量数据。但是,Spark需要更高的内存分配,因为它像标准数据库一样将要处理的数据加载到内存中并保留mincache一段时间。因此,选择取决于性能级别和内存限制。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号