首页> 中文期刊> 《计算机科学》 >基于ORC元数据的Hive Join查询Reducer负载均衡方法

基于ORC元数据的Hive Join查询Reducer负载均衡方法

         

摘要

The load imbalance problem ranks first among the performance issues in large-scale MapReduce cluster,and it's very prone to be triggered by Hive join queries.An effective solution is to design reducer load balancing partitioning algorithms by consulting the key's frequency distribution histogram estimated from intermediate key-value pairs.The existing works of key histogram estimation rely on monitoring and sampling the output of map in a distributed way, which triggers huge network traffic load and notably delays the start of the shuffle.A novel key histogram estimation method based on ORC metadata and the corresponding load balancing partitioning strategy was proposed for Hive join queries.The proposals only need some light-weight computation before the start of the job,thus imposing no extra loads on network traffics and the shuffle.Benchmarking test proves the proposal's significant improvement on both the key histogram estimation and the reducer load balancing.%负载不均衡问题位列影响大规模MapReduce集群性能因素的首位,而 Hive join查询非常容易触发该问题.通用解决方案是基于中间键值对的key频率分布设计能够实现负载均衡的key划分算法.现有工作估算key频率分布时依赖于对map的输出进行监控采样,使得通信开销较大并显著延后了shuffle的启动.针对 Hive join查询,提出了基于ORC元数据的key频率分布估计方法和相应的负载均衡key划分方法.该方法具有计算量小、通信开销小、不影响现有shuffle机制的优点.通过基准测试证明了该方法在key频率分布估算效率上的巨大提升及相应的key划分方法对Hive join查询性能的提升.

著录项

相似文献

  • 中文文献
  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号