首页> 外文期刊>IEEE Transactions on Parallel and Distributed Systems >A Comprehensive Study of MapReduce Over Lustre for Intermediate Data Placement and Shuffle Strategies on HPC Clusters
【24h】

A Comprehensive Study of MapReduce Over Lustre for Intermediate Data Placement and Shuffle Strategies on HPC Clusters

机译:针对HPC群集上的中间数据放置和混洗策略的MapReduce over Lustre的综合研究

获取原文
获取原文并翻译 | 示例

摘要

With high performance interconnects and parallel file systems, running MapReduce over modern High Performance Computing (HPC) clusters has attracted much attention due to its uniqueness of solving data analytics problems with a combination of Big Data and HPC technologies. Since the MapReduce architecture relies heavily on the availability of local storage media, the Lustre-based global storage in HPC clusters poses many new opportunities and challenges. In this paper, we perform a comprehensive study on different MapReduce over Lustre deployments and propose a novel high-performance design of YARN MapReduce on HPC clusters by utilizing Lustre as the additional storage provider for intermediate data. With a deployment architecture where both local disks and Lustre are utilized for intermediate data storage, we propose a novel priority directory selection scheme through which RDMA-enhanced MapReduce can choose the best intermediate storage during runtime by on-line profiling. Our results indicate that, we can achieve 44 percent performance benefit for shuffle-intensive workloads in leadership-class HPC systems. Our priority directory selection scheme can improve the job execution time by 63 percent over default MapReduce while executing multiple concurrent jobs. To the best of our knowledge, this is the first such comprehensive study for YARN MapReduce with Lustre and RDMA.
机译:借助高性能互连和并行文件系统,在现代高性能计算(HPC)群集上运行MapReduce备受关注,因为它具有结合大数据和HPC技术解决数据分析问题的独特性。由于MapReduce体系结构严重依赖于本地存储介质的可用性,因此HPC群集中基于Lustre的全局存储带来了许多新的机遇和挑战。在本文中,我们对Luster部署上的不同MapReduce进行了全面研究,并通过利用Luster作为中间数据的附加存储提供程序,提出了HPC群集上的YARN MapReduce的新型高性能设计。通过将本地磁盘和Luster都用于中间数据存储的部署体系结构,我们提出了一种新颖的优先级目录选择方案,通过该优先级目录选择方案,RDMA增强的MapReduce可以在运行时通过在线性能分析来选择最佳的中间存储。我们的结果表明,对于领先级HPC系统中的洗牌密集型工作负载,我们可以获得44%的性能优势。我们的优先级目录选择方案可以在执行多个并发作业时将作业执行时间比默认的MapReduce缩短63%。据我们所知,这是针对带有Lustre和RDMA的YARN MapReduce的首次全面研究。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号