首页> 外文会议>International conference on big data analytics and knowledge discovery >S2D: Shared Distributed Datasets, Storing Shared Data for Multiple and Massive Queries Optimization in a Distributed Data Warehouse
【24h】

S2D: Shared Distributed Datasets, Storing Shared Data for Multiple and Massive Queries Optimization in a Distributed Data Warehouse

机译:S2D:共享的分布式数据集,在分布式数据仓库中存储共享数据以进行多个和大规模的查询优化

获取原文

摘要

Nowadays, with the constantly increasing amount of data, we are facing a growing number of users, who are characterized by a frequent and a massively concurrent data access. The large number of users pose multiple query optimization problems. In a distributed data warehousing system such as Hadoop/Hive, queries are evaluated one at a time and processed with the MapReduce paradigm. The massive query execution usually overloads and slows down the entire distributed environment mainly due to multiple data scan tasks. In this paper we aim to optimize the multiple query execution performance on Hive. We propose Shared Distributed Datasets (S2D), a method that dynamically looks for and shares common data among queries. The evaluation shows that, compared to Hive, S2D consumes on average 20% less memory in the Map-scan task and it is 12% faster regarding the execution time of interactive and reporting queries from TPC-DS.
机译:如今,随着数据量的不断增加,我们面临着越来越多的用户,这些用户的特征是频繁且大量并发的数据访问。大量用户带来了多个查询优化问题。在诸如Hadoop / Hive之类的分布式数据仓库系统中,查询一次要评估一次,并使用MapReduce范例进行处理。大型查询执行通常会由于多个数据扫描任务而使整个分布式环境超载并减慢其速度。本文旨在优化Hive上的多查询执行性能。我们提出了共享分布式数据集(S2D),这是一种动态查找并在查询之间共享公共数据的方法。评估显示,与Hive相比,S2D在“地图扫描”任务中平均减少了20%的内存,而从TPC-DS进行交互式和报告查询的执行时间方面,它的速度要快12%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号