首页> 外文会议>TPC Technology Conference on Performance Evaluation and Benchmarking >Characterizing BigBench Queries, Hive, and Spark in Multi-cloud Environments
【24h】

Characterizing BigBench Queries, Hive, and Spark in Multi-cloud Environments

机译:在多云环境中表征Bigbench查询,蜂巢和火花

获取原文

摘要

BigBench is the new standard (TPCx-BB) for benchmarking and testing Big Data systems. The TPCx-BB specification describes several business use cases - queries - which require a broad combination of data extraction techniques including SQL, Map/Reduce (M/R), user code (UDF), and Machine Learning to fulfill them. However, currently, there is no widespread knowledge of the different resource requirements and expected performance of each query, as is the case to more established benchmarks. Moreover, over the last year, the Spark framework and APIs have been evolving very rapidly, with major improvements in performance and the stable release of v2. It is our intent to compare the current state of Spark to Hive's base implementation which can use the legacy M/R engine and Mahout or the current Tez and MLlib frameworks. At the same time, cloud providers currently offer convenient on-demand managed big data clusters (PaaS) with a pay-as-you-go model. In PaaS, analytical engines such as Hive and Spark come ready to use, with a general-purpose configuration and upgrade management. The study characterizes both the BigBench queries and the out-of-the-box performance of Spark and Hive versions in the cloud. At the same time, comparing popular PaaS offerings in terms of reliability, data scalability (1 GB to 10TB), versions, and settings from Azure HDinsight, Amazon Web Services EMR, and Google Cloud Dataproc. The query characterization highlights the similarities and differences in Hive an Spark frameworks, and which queries are the most resource consuming according to CPU, memory, and I/O. Scalability results show how there is a need for configuration tuning in most cloud providers as data scale grows, especially with Sparks memory usage. These results can help practitioners to quickly test systems by picking a subset of the queries which stresses each of the categories. At the same time, results show how Hive and Spark compare and what performance can be expected of each in PaaS.
机译:BigBench是用于基准测试和测试大数据系统的新标准(TPCX-BB)。 TPCX-BB规范描述了几种商业用例 - 查询 - 这需要广泛的数据提取技术组合,包括SQL,MAP / REASUR(M / R),用户代码(UDF)和机器学习来实现它们。但是,目前,没有广泛了解每个查询的不同资源要求和预期性能,就像更确定的基准一样。此外,在去年,Spark框架和API一直在演变,具有性能的重大改进和v2的稳定释放。我们的意图可以将当前的火花状态与Hive的基础实现进行比较,这可以使用传统的M / R发动机和MAHOUT或当前TEZ和MLLIB框架。与此同时,云提供商目前提供方便的按需管理大数据集群(PAA),即您的付费模型。在PaaS中,蜂巢和火花等分析发动机可以使用通用配置和升级管理。该研究表征了BigBench查询和云中的Spark和Hive版本的开箱即用性能。与此同时,在可靠性,数据可伸缩性(1 GB到10TB),版本和Azure HDInsight,Amazon Web Services EMR和Google Cloud DataProc的方面比较流行PaaS产品。查询表征突出了蜂窝框架的相似性和差异,并且根据CPU,内存和I / O,查询是最多的资源消耗。可伸缩性结果显示,在大多数云提供商中,如何在数据尺度增长时,如何在大多数云提供商中调整,尤其是引发内存使用情况。这些结果可以帮助从业者通过挑选强调每个类别的查询的子集来快速测试系统。与此同时,结果显示了如何蜂巢和火花比较以及PaaS中的每个人可以预期的表现。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号