首页> 外文会议>IEEE international conference on data engineering >DualTable: A hybrid storage model for update optimization in Hive
【24h】

DualTable: A hybrid storage model for update optimization in Hive

机译:Dualtable:Hive中更新优化的混合存储模型

获取原文

摘要

Hive is the most mature and prevalent data warehouse tool providing SQL-like interface in the Hadoop ecosystem. It is successfully used in many Internet companies and shows its value for big data processing in traditional industries. However, enterprise big data processing systems as in Smart Grid applications usually require complicated business logics and involve many data manipulation operations like updates and deletes. Hive cannot offer sufficient support for these while preserving high query performance. Hive using the Hadoop Distributed File System (HDFS) for storage cannot implement data manipulation efficiently and Hive on HBase suffers from poor query performance even though it can support faster data manipulation. There is a project based on Hive issue Hive-5317 to support update operations, but it has not been finished in Hive's latest version. Since this ACID compliant extension adopts same data storage format on HDFS, the update performance problem is not solved. In this paper, we propose a hybrid storage model called DualTable, which combines the efficient streaming reads of HDFS and the random write capability of HBase. Hive on DualTable provides better data manipulation support and preserves query performance at the same time. Experiments on a TPC-H data set and on a real smart grid data set show that Hive on DualTable is up to 10 times faster than Hive when executing update and delete operations.
机译:Hive是在Hadoop生态系统中提供SQL样界面的最成熟和最普遍的数据仓库工具。它成功地在许多互联网公司中使用,并显示了传统行业中的大数据处理的价值。但是,Enterprise大数据处理系统,如智能电网应用程序通常需要复杂的业务逻辑,并涉及许多数据操作操作,如更新和删除。 Hive无法为这些提供足够的支持,同时保留高查询性能。使用Hadoop分布式文件系统(HDFS)用于存储无法有效地实现数据操作,并且在HBase上蜂拥而至的查询性能,即使它可以支持更快的数据操作。有一个基于Hive问题Hive-5317的项目来支持更新操作,但它尚未在Hive的最新版本中完成。由于该酸常则的扩展在HDF上采用相同的数据存储格式,因此未解决更新性能问题。在本文中,我们提出了一种称为DueBTable的混合存储模型,其结合了高效的HDF流读取和HBase的随机写入能力。在DualTable上蜂​​拥而至提供更好的数据操作支持并同时保留查询性能。 TPC-H数据集和实际智能电网数据集的实验显示,在执行更新和删除操作时,Dualtable上的Hive高达10倍。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号