A parallel hash join algorithm for managing data skew

Wolf J.L.; Yu P.S.

首页> 外文期刊>IEEE Transactions on Parallel and Distributed Systems >A parallel hash join algorithm for managing data skew

【24h】

A parallel hash join algorithm for managing data skew

机译：用于管理数据偏斜的并行哈希联接算法

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Presents a parallel hash join algorithm that is based on the concept of hierarchical hashing, to address the problem of data skew. The proposed algorithm splits the usual hash phase into a hash phase and an explicit transfer phase, and adds an extra scheduling phase between these two. During the scheduling phase, a heuristic optimization algorithm, using the output of the hash phase, attempts to balance the load across the multiple processors in the subsequent join phase. The algorithm naturally identifies the hash partitions with the largest skew values and splits them as necessary, assigning each of them to an optimal number of processors. Assuming for concreteness a Zipf-like distribution of the values in the join column, a join phase which is CPU-bound, and a shared nothing environment, the algorithm is shown to achieve good join phase load balancing, and to be robust relative to the degree of data skew and the total number of processors. The overall speedup due to this algorithm is compared to some existing parallel hash join methods. The proposed method does considerably better in high skew situations.

机译：提出了一种基于层次哈希的概念的并行哈希联接算法，以解决数据倾斜问题。所提出的算法将通常的哈希阶段分为哈希阶段和显式传输阶段，并在这两者之间添加了额外的调度阶段。在调度阶段，使用哈希阶段输出的启发式优化算法尝试在后续联接阶段平衡多个处理器之间的负载。该算法自然会识别出具有最大偏斜值的哈希分区，并根据需要对其进行拆分，然后将每个哈希分区分配给最佳数量的处理器。假设具体性，假设联接列中的值具有类似于Zipf的分布，受CPU约束的联接阶段以及无共享环境，该算法显示出实现了良好的联接阶段负载平衡，并且相对于数据偏斜程度和处理器总数。将这种算法所带来的整体速度与某些现有的并行哈希联接方法进行了比较。所提出的方法在高偏斜情况下的性能明显更好。

著录项

来源
《IEEE Transactions on Parallel and Distributed Systems》 |1993年第12期|P.1355-1371|共17页
作者
Wolf J.L.; Yu P.S.;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类计算技术、计算机技术;
关键词

相似文献

外文文献
中文文献
专利

1. A parallel sort merge join algorithm for managing data skew [J] . Wolf J.L., Dias D.M. IEEE Transactions on Parallel and Distributed Systems . 1993,第1期

机译：用于管理数据偏斜的并行排序合并联接算法
2. New algorithms for parallelizing relational database joins in the presence of data skew [J] . Wolf J.L., Dias D.M. IEEE Transactions on Knowledge and Data Engineering . 1994,第6期

机译：存在数据倾斜时用于并行化关系数据库联接的新算法
3. Parallel set similarity join on big data based on Locality-Sensitive Hashing [J] . Mohammad Karim Sohrabi, Hosseion Azgomi Science of Computer Programming . 2017,第octa1期

机译：基于局部敏感哈希的大数据并行集合相似性联接
4. Bucket Spreading Parallel Hash: A New, Robust, Parallel Hash Join Method for Data Skew in the Super Database Computer (SDC) [C] . International conference on Very Large Data Bases . 1990

机译：铲斗扩散平行散列：超级数据库计算机（SDC）中的数据偏差的新的，坚固，并行散列连接方法（SDC）
5. Performance of the distributed hash join algorithms in a distributed heterogeneous supercomputing environment [D] . Khan, Zahira Saleem. 1995

机译：分布式异构超级计算环境中分布式哈希联接算法的性能
6. Hashing Algorithms and Data Structures for Rapid Searches of Fingerprint Vectors [O] . Ramzi Nasr, Daniel S. Hirschberg, Pierre Baldi -1

机译：哈希算法和数据结构的指纹向量的快速搜索
7. Predictive Dynamic Load Balancing of Parallel Hash-Joins over Heterogeneous Processors in the Presence of Data Skew [O] . Dewan Hasanat M., Stolfo Salvatore, Hernandez Mauricio, 1994

机译：存在数据时滞的异构处理器上并行哈希联接的预测动态负载平衡
8. Parallel hashed oct-tree N-body algorithm [R] . Warren, M. S., Salmon, J. K. 1993

机译：并行散列八叉树N体算法

A parallel hash join algorithm for managing data skew

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅