Evaluating partitioning and bucketing strategies for Hive-based Big Data Warehousing systems

Eduarda Costa; Carlos Costa; Maribel Yasmina Santos

首页> 外文期刊>Journal of Big Data >Evaluating partitioning and bucketing strategies for Hive-based Big Data Warehousing systems

【24h】

Evaluating partitioning and bucketing strategies for Hive-based Big Data Warehousing systems

机译：评估基于Hive的大数据仓库系统的分区和存储策略

获取原文

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Abstract Hive has long been one of the industry-leading systems for Data Warehousing in Big Data contexts, mainly organizing data into databases, tables, partitions and buckets, stored on top of an unstructured distributed file system like HDFS. Some studies were conducted for understanding the ways of optimizing the performance of several storage systems for Big Data Warehousing. However, few of them explore the impact of data organization strategies on query performance, when using Hive as the storage technology for implementing Big Data Warehousing systems. Therefore, this paper evaluates the impact of data partitioning and bucketing in Hive-based systems, testing different data organization strategies and verifying the efficiency of those strategies in query performance. The obtained results demonstrate the advantages of implementing Big Data Warehouses based on denormalized models and the potential benefit of using adequate partitioning strategies. Defining the partitions aligned with the attributes that are frequently used in the conditions/filters of the queries can significantly increase the efficiency of the system in terms of response time. In the more intensive workload benchmarked in this paper, overall decreases of about 40% in processing time were verified. The same is not verified with the use of bucketing strategies, which shows potential benefits in very specific scenarios, suggesting a more restricted use of this functionality, namely in the context of bucketing two tables by the join attribute of these tables.

机译：摘要Hive长期以来一直是大数据上下文中数据仓库的行业领先系统之一，主要将数据组织到数据库，表，分区和存储桶中，并存储在非结构化分布式文件系统（如HDFS）的顶部。进行了一些研究，以了解优化大数据仓库的几个存储系统性能的方法。但是，当使用Hive作为实现大数据仓库系统的存储技术时，很少有人探索数据组织策略对查询性能的影响。因此，本文评估了基于Hive的系统中数据分区和存储的影响，测试了不同的数据组织策略，并验证了这些策略在查询性能方面的效率。获得的结果证明了基于非规范化模型实施大数据仓库的优势以及使用适当分区策略的潜在优势。定义与查询条件/过滤器中经常使用的属性对齐的分区可以显着提高系统的响应时间效率。在以本文为基准的更密集的工作负载中，验证了总体处理时间减少了约40％。使用存储桶策略无法验证这一点，这在非常特定的场景中显示出潜在的好处，这表明该功能的使用受到更多限制，即在通过这些表的join属性对两个表进行存储的情况下。

著录项

来源
《Journal of Big Data》 |2019年第1期|共38页
作者
Eduarda Costa; Carlos Costa; Maribel Yasmina Santos;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类计算技术、计算机技术;
关键词
Big DataBig Data WarehouseHivePartitionsBuckets;

机译：大数据大数据仓库蜂房分区桶;

相似文献

外文文献
中文文献
专利

1. Phylogenetic Systematics and Biogeography of Hummingbirds: Bayesian and Maximum Likelihood Analyses of Partitioned Data and Selection of an Appropriate Partitioning Strategy [J] . Jimmy A. McGuire1 Christopher C. Witt2 Douglas L. Altshuler3 and J. V. Remsen4 Systematic Biology . 2007,第5期

机译：蜂鸟的系统发生学和生物地理学：分区数据的贝叶斯和最大似然分析以及适当分区策略的选择
2. Phylogenetic systematics and biogeography of hummingbirds: bayesian and maximum likelihood analyses of partitioned data and selection of an appropriate partitioning strategy [J] . McGuire JA, Witt CC, Altshuler DL, Systematic Biology . 2007,第5期

机译：蜂鸟的系统发生学和生物地理学：分区数据的贝叶斯和最大似然分析以及适当分区策略的选择
3. Comparing strategies to integrate health information systems following a data warehouse approach in four countries [J] . Johan Ivar Saebo, Edem Kwame Kossi, Ola Hodne Titlestad, Information Technology for Development . 2011,第1期

机译：比较四个国家采用数据仓库方法整合健康信息系统的策略
4. Partitioning and Bucketing in Hive-Based Big Data Warehouses [C] . Eduarda Costa, Carlos Costa, Maribel Yasmina Santos World Conference on Information Systems and Technologies . 2018

机译：蜂巢基大数据仓库分区和铲斗
5. Performance evaluation of big data placement structures in MapReduce-based data warehouse systems. [D] . Hasan, Mohammad Rakibul. 2016

机译：基于MapReduce的数据仓库系统中大数据放置结构的性能评估。
6. The Implementation of Data Warehouse and OLAP for Rehabilitation Outcome Evaluation: ReDWinE System [O] . Fei-Ran Guo, Bambang Parmanto, James J. Irrgang, 2000

机译：数据仓库和OLAP进行康复成果评估的实施：ReDWinE系统
7. Comparing Strategies to Integrate Health Information Systems Following a Data Warehouse Approach in Four Countries [O] . Edem K. Kossi, Johan Ivar Sæbo, Romain R. Tohouri, 2014

机译：四个国家采用数据仓库方法比较整合卫生信息系统的策略
8. Re-Evaluation of HSE DATA in Light of High P-T Partitioning Data: Late Chondritic Addition to Inner Solar System Bodies Not Always Required for HSE. [R] . Righter, K. 2015

机译：根据高p-T分配数据重新评估HsE数据：HsE并不总是需要对内太阳系机体进行晚期球粒陨石加入。

Evaluating partitioning and bucketing strategies for Hive-based Big Data Warehousing systems

摘要

著录项

相似文献

相关主题

期刊订阅