首页> 外文期刊>IEEE Transactions on Parallel and Distributed Systems >A parallel hash join algorithm for managing data skew
【24h】

A parallel hash join algorithm for managing data skew

机译:用于管理数据偏斜的并行哈希联接算法

获取原文
获取原文并翻译 | 示例
       

摘要

Presents a parallel hash join algorithm that is based on the concept of hierarchical hashing, to address the problem of data skew. The proposed algorithm splits the usual hash phase into a hash phase and an explicit transfer phase, and adds an extra scheduling phase between these two. During the scheduling phase, a heuristic optimization algorithm, using the output of the hash phase, attempts to balance the load across the multiple processors in the subsequent join phase. The algorithm naturally identifies the hash partitions with the largest skew values and splits them as necessary, assigning each of them to an optimal number of processors. Assuming for concreteness a Zipf-like distribution of the values in the join column, a join phase which is CPU-bound, and a shared nothing environment, the algorithm is shown to achieve good join phase load balancing, and to be robust relative to the degree of data skew and the total number of processors. The overall speedup due to this algorithm is compared to some existing parallel hash join methods. The proposed method does considerably better in high skew situations.
机译:提出了一种基于层次哈希的概念的并行哈希联接算法,以解决数据倾斜问题。所提出的算法将通常的哈希阶段分为哈希阶段和显式传输阶段,并在这两者之间添加了额外的调度阶段。在调度阶段,使用哈希阶段输出的启发式优化算法尝试在后续联接阶段平衡多个处理器之间的负载。该算法自然会识别出具有最大偏斜值的哈希分区,并根据需要对其进行拆分,然后将每个哈希分区分配给最佳数量的处理器。假设具体性,假设联接列中的值具有类似于Zipf的分布,受CPU约束的联接阶段以及无共享环境,该算法显示出实现了良好的联接阶段负载平衡,并且相对于数据偏斜程度和处理器总数。将这种算法所带来的整体速度与某些现有的并行哈希联接方法进行了比较。所提出的方法在高偏斜情况下的性能明显更好。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号