首页> 外文期刊>IEEE Transactions on Knowledge and Data Engineering >Parallel Star Join+DataIndexes: efficient query processing in data warehouses and OLAP
【24h】

Parallel Star Join+DataIndexes: efficient query processing in data warehouses and OLAP

机译:并行Star Join + DataIndexes:数据仓库和OLAP中的高效查询处理

获取原文
获取原文并翻译 | 示例
       

摘要

On-line analytical processing (OLAP) refers to the technologies that allow users to efficiently retrieve data from the data warehouse for decision-support purposes. Data warehouses tend to be extremely large, it is quite possible for a data warehouse to be hundreds of gigabytes to terabytes in size (Chauduri and Dayal, 1997). Queries tend to be complex and ad hoc, often requiring computationally expensive operations such as joins and aggregation. Given this, we are interested in developing strategies for improving query processing in data warehouses by exploring the applicability of parallel processing techniques. In particular, we exploit the natural partitionability of a star schema and render it even more efficient by applying DataIndexes-a storage structure that serves both as an index as well as data and lends itself naturally to vertical partitioning of the data. DataIndexes are derived from the various special purpose access mechanisms currently supported in commercial OLAP products. Specifically, we propose a declustering strategy which incorporates both task and data partitioning and present the Parallel Star Join (PSJ) Algorithm, which provides a means to perform a star join in parallel using efficient operations involving only rowsets and projection columns. We compare the performance of the PSJ Algorithm with two parallel query processing strategies. The first is a parallel join strategy utilizing the Bitmap Join Index (BJI), arguably the state-of-the-art OLAP join structure in use today. For the second strategy we choose a well-known parallel join algorithm, namely the pipelined hash algorithm. To assist in the performance comparison, we first develop a cost model of the disk access and transmission costs for all three approaches.
机译:在线分析处理(OLAP)是指允许用户有效地从数据仓库检索数据以支持决策的技术。数据仓库往往非常大,数据仓库的大小很有可能达到数百GB到TB(Chauduri和Dayal,1997)。查询往往是复杂且临时的,通常需要计算量大的操作,例如联接和聚合。鉴于此,我们有兴趣通过探索并行处理技术的适用性来开发改进数据仓库中查询处理的策略。特别是,我们利用星型模式的自然可分区性,并通过应用DataIndexes(使它既充当索引又充当数据并自然地适合于数据的垂直分区)的存储结构来使其更加高效。 DataIndex是从商业OLAP产品当前支持的各种特殊用途的访问机制派生的。具体来说,我们提出了一种将任务和数据分区结合在一起的分簇策略,并提出了并行星形联接(PSJ)算法,该算法提供了一种仅使用行集和投影列的高效操作即可并行执行星形联接的方法。我们将PSJ算法与两种并行查询处理策略的性能进行比较。第一种是利用位图连接索引(BJI)的并行连接策略,可以说是当今使用的最先进的OLAP连接结构。对于第二种策略,我们选择一种众所周知的并行联接算法,即流水线哈希算法。为了帮助进行性能比较,我们首先针对这三种方法开发了磁盘访问和传输成本的成本模型。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号