首页> 外文会议>IEEE International Conference on Data Engineering >Flow-Join: Adaptive skew handling for distributed joins over high-speed networks
【24h】

Flow-Join: Adaptive skew handling for distributed joins over high-speed networks

机译:Flow-Join:自适应偏斜处理,用于高速网络上的分布式联接

获取原文

摘要

Modern InfiniBand interconnects offer link speeds of several gigabytes per second and a remote direct memory access (RDMA) paradigm for zero-copy network communication. Both are crucial for parallel database systems to achieve scalable distributed query processing where adding a server to the cluster increases performance. However, the scalability of distributed joins is threatened by unexpected data characteristics: Skew can cause a severe load imbalance such that a single server has to process a much larger part of the input than its fair share and by this slows down the entire distributed query. We introduce Flow-Join, a novel distributed join algorithm that handles attribute value skew with minimal overhead. Flow-Join detects heavy hitters at runtime using small approximate histograms and adapts the redistribution scheme to resolve load imbalances before they impact the join performance. Previous approaches often involve expensive analysis phases, which slow down distributed join processing for non-skewed workloads. This is especially the case for modern high-speed interconnects, which are too fast to hide the extra computation. Other skew handling approaches require detailed statistics, which are often not available or overly inaccurate for intermediate results. In contrast, Flow-Join uses our novel lightweight skew handling scheme to execute at the full network speed of more than 6 GB/s for InfiniBand 4¿¿FDR, joining a skewed input at 11.5 billion tuples/s with 32 servers. This is 6.8¿¿ faster than a standard distributed hash join using the same hardware. At the same time, Flow-Join does not compromise the join performance for non-skewed workloads.
机译:现代的InfiniBand互连提供每秒几GB的链接速度,以及用于零拷贝网络通信的远程直接内存访问(RDMA)范例。两者对于并行数据库系统实现可伸缩的分布式查询处理都是至关重要的,在此过程中,将服务器添加到群集可提高性能。但是,分布式联接的可伸缩性受到意外数据特性的威胁:偏移会导致严重的负载不平衡,以至于一台服务器必须处理比其合理份额更大的一部分输入,从而减慢了整个分布式查询的速度。我们介绍了Flow-Join,这是一种新颖的分布式联接算法,它以最小的开销处理属性值偏斜。 Flow-Join在运行时使用小的近似直方图检测重击手,并在影响连接性能之前调整重分配方案以解决负载不平衡问题。先前的方法通常涉及昂贵的分析阶段,这会减慢针对非倾斜工作负载的分布式联接处理。对于现代高速互连而言尤其如此,因为它们太快了以至于无法隐藏额外的计算。其他偏斜处理方法需要详细的统计信息,对于中间结果,这些统计信息通常不可用或过于不准确。相比之下,Flow-Join使用我们新颖的轻量级偏移处理方案为InfiniBand 4 ?? FDR以超过6 GB / s的全网速执行,并以32台服务器以115亿个元组/ s的速率连接输入。这比使用相同硬件的标准分布式哈希连接快6.8个。同时,Flow-Join不会影响非偏斜工作负载的联接性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号