Characterizing BigBench Queries, Hive, and Spark in Multi-cloud Environments

机译：在多云环境中表征Bigbench查询，蜂巢和火花

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

BigBench is the new standard (TPCx-BB) for benchmarking and testing Big Data systems. The TPCx-BB specification describes several business use cases - queries - which require a broad combination of data extraction techniques including SQL, Map/Reduce (M/R), user code (UDF), and Machine Learning to fulfill them. However, currently, there is no widespread knowledge of the different resource requirements and expected performance of each query, as is the case to more established benchmarks. Moreover, over the last year, the Spark framework and APIs have been evolving very rapidly, with major improvements in performance and the stable release of v2. It is our intent to compare the current state of Spark to Hive's base implementation which can use the legacy M/R engine and Mahout or the current Tez and MLlib frameworks. At the same time, cloud providers currently offer convenient on-demand managed big data clusters (PaaS) with a pay-as-you-go model. In PaaS, analytical engines such as Hive and Spark come ready to use, with a general-purpose configuration and upgrade management. The study characterizes both the BigBench queries and the out-of-the-box performance of Spark and Hive versions in the cloud. At the same time, comparing popular PaaS offerings in terms of reliability, data scalability (1 GB to 10TB), versions, and settings from Azure HDinsight, Amazon Web Services EMR, and Google Cloud Dataproc. The query characterization highlights the similarities and differences in Hive an Spark frameworks, and which queries are the most resource consuming according to CPU, memory, and I/O. Scalability results show how there is a need for configuration tuning in most cloud providers as data scale grows, especially with Sparks memory usage. These results can help practitioners to quickly test systems by picking a subset of the queries which stresses each of the categories. At the same time, results show how Hive and Spark compare and what performance can be expected of each in PaaS.

机译：BigBench是用于基准测试和测试大数据系统的新标准（TPCX-BB）。 TPCX-BB规范描述了几种商业用例 - 查询 - 这需要广泛的数据提取技术组合，包括SQL，MAP / REASUR（M / R），用户代码（UDF）和机器学习来实现它们。但是，目前，没有广泛了解每个查询的不同资源要求和预期性能，就像更确定的基准一样。此外，在去年，Spark框架和API一直在演变，具有性能的重大改进和v2的稳定释放。我们的意图可以将当前的火花状态与Hive的基础实现进行比较，这可以使用传统的M / R发动机和MAHOUT或当前TEZ和MLLIB框架。与此同时，云提供商目前提供方便的按需管理大数据集群（PAA），即您的付费模型。在PaaS中，蜂巢和火花等分析发动机可以使用通用配置和升级管理。该研究表征了BigBench查询和云中的Spark和Hive版本的开箱即用性能。与此同时，在可靠性，数据可伸缩性（1 GB到10TB），版本和Azure HDInsight，Amazon Web Services EMR和Google Cloud DataProc的方面比较流行PaaS产品。查询表征突出了蜂窝框架的相似性和差异，并且根据CPU，内存和I / O，查询是最多的资源消耗。可伸缩性结果显示，在大多数云提供商中，如何在数据尺度增长时，如何在大多数云提供商中调整，尤其是引发内存使用情况。这些结果可以帮助从业者通过挑选强调每个类别的查询的子集来快速测试系统。与此同时，结果显示了如何蜂巢和火花比较以及PaaS中的每个人可以预期的表现。

著录项

来源
《TPC Technology Conference on Performance Evaluation and Benchmarking》|2018年|184p|共20页
会议地点
作者
Nicolas Poggi; Alejandro Montero; David Carrera;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP3-53;
关键词

相似文献

外文文献
中文文献
专利

1. Semantic connection set-based massive RDF data query processing in Spark environment [J] . Jiuyun Xu, Chao Zhang Eurasip Journal on Wireless Communications and Networking . 2019,第1期

机译：Spark环境中基于语义连接的基于MASTIVE RDF数据查询处理
2. Semantic connection set-based massive RDF data query processing in Spark environment [J] . Jiuyun Xu, Chao Zhang Eurasip Journal on Wireless Communications and Networking . 2019,第1期

机译：Spark环境中的语义连接基于集基的大型RDF数据查询处理
3. Comparison of Hive's query optimisation techniques [J] . Sikha Bagui, Keerthi Devulapalli International Journal of Big Data Intelligence . 2018,第4期

机译：Hive查询优化技术比较
4. Characterizing BigBench Queries, Hive, and Spark in Multi-cloud Environments [C] . Nicolas Poggi, Alejandro Montero, David Carrera TPC Technology Conference on Performance Evaluation and Benchmarking . 2018

机译：在多云环境中表征Bigbench查询，蜂巢和火花
5. Exploiting Heterogeneous Resources in a Multi-Cloud Environment [D] . Oh, Kwangsung. 2019

机译：利用多云环境中的异构资源
6. An adaptive spark-based framework for querying large-scale NoSQL and relational databases [O] . Eman Khashan, Ali Eldesouky, Sally Elghamrawy 2021

机译：用于查询大型NoSQL和关系数据库的自适应火花基框架
7. Characterizing BigBench Queries, Hive, and Spark in Multi-cloud Environments [O] . Nicolas Poggi, Alejandro Montero, David Carrera 2017

机译：在多云环境中表征Bigbench查询，蜂巢和火花

Characterizing BigBench Queries, Hive, and Spark in Multi-cloud Environments

摘要

著录项

相似文献

相关主题

期刊订阅