首页> 外文期刊>Journal of Big Data >Evaluating partitioning and bucketing strategies for Hive-based Big Data Warehousing systems
【24h】

Evaluating partitioning and bucketing strategies for Hive-based Big Data Warehousing systems

机译:评估基于Hive的大数据仓库系统的分区和存储策略

获取原文
           

摘要

Abstract Hive has long been one of the industry-leading systems for Data Warehousing in Big Data contexts, mainly organizing data into databases, tables, partitions and buckets, stored on top of an unstructured distributed file system like HDFS. Some studies were conducted for understanding the ways of optimizing the performance of several storage systems for Big Data Warehousing. However, few of them explore the impact of data organization strategies on query performance, when using Hive as the storage technology for implementing Big Data Warehousing systems. Therefore, this paper evaluates the impact of data partitioning and bucketing in Hive-based systems, testing different data organization strategies and verifying the efficiency of those strategies in query performance. The obtained results demonstrate the advantages of implementing Big Data Warehouses based on denormalized models and the potential benefit of using adequate partitioning strategies. Defining the partitions aligned with the attributes that are frequently used in the conditions/filters of the queries can significantly increase the efficiency of the system in terms of response time. In the more intensive workload benchmarked in this paper, overall decreases of about 40% in processing time were verified. The same is not verified with the use of bucketing strategies, which shows potential benefits in very specific scenarios, suggesting a more restricted use of this functionality, namely in the context of bucketing two tables by the join attribute of these tables.
机译:摘要Hive长期以来一直是大数据上下文中数据仓库的行业领先系统之一,主要将数据组织到数据库,表,分区和存储桶中,并存储在非结构化分布式文件系统(如HDFS)的顶部。进行了一些研究,以了解优化大数据仓库的几个存储系统性能的方法。但是,当使用Hive作为实现大数据仓库系统的存储技术时,很少有人探索数据组织策略对查询性能的影响。因此,本文评估了基于Hive的系统中数据分区和存储的影响,测试了不同的数据组织策略,并验证了这些策略在查询性能方面的效率。获得的结果证明了基于非规范化模型实施大数据仓库的优势以及使用适当分区策略的潜在优势。定义与查询条件/过滤器中经常使用的属性对齐的分区可以显着提高系统的响应时间效率。在以本文为基准的更密集的工作负载中,验证了总体处理时间减少了约40%。使用存储桶策略无法验证这一点,这在非常特定的场景中显示出潜在的好处,这表明该功能的使用受到更多限制,即在通过这些表的join属性对两个表进行存储的情况下。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号