首页> 外文会议>Supercomputing frontiers >On the Performance of Spark on HPC Systems: Towards a Complete Picture
【24h】

On the Performance of Spark on HPC Systems: Towards a Complete Picture

机译:关于高性能计算系统上Spark的性能:全面了解

获取原文
获取原文并翻译 | 示例

摘要

Big Data analytics frameworks (e.g., Apache Hadoop and Apache Spark) have been increasingly used by many companies and research labs to facilitate large-scale data analysis. However, with the growing needs of users and size of data, commodity-based infrastructure will strain under the heavy weight of Big Data. On the other hand, HPC systems offer a rich set of opportunities for Big Data processing. As first steps toward Big Data processing on HPC systems, several research efforts have been devoted to understanding the performance of Big Data applications on these systems. Yet the HPC specific performance considerations have not been fully investigated. In this work, we conduct an experimental campaign to provide a clearer understanding of the performance of Spark, the de facto in-memory data processing framework, on HPC systems. We ran Spark using representative Big Data workloads on Grid'5000 testbed to evaluate how the latency, contention and file system's configuration can influence the application performance. We discuss the implications of our findings and draw attention to new ways (e.g., burst buffers) to improve the performance of Spark on HPC systems.
机译:大数据分析框架(例如Apache Hadoop和Apache Spark)已被许多公司和研究实验室越来越多地用于促进大规模数据分析。但是,随着用户需求的增长和数据量的增加,基于商品的基础架构将在大数据的沉重负担下承受压力。另一方面,HPC系统为大数据处理提供了丰富的机会。作为在HPC系统上进行大数据处理的第一步,已经进行了一些研究工作来理解这些系统上大数据应用程序的性能。但是,尚未完全研究HPC特定的性能注意事项。在这项工作中,我们进行了一项实验性活动,以更清楚地了解Spark(事实上的内存数据处理框架)在HPC系统上的性能。我们在Grid'5000测试床上使用代表性的大数据工作负载运行了Spark,以评估延迟,争用和文件系统的配置如何影响应用程序性能。我们讨论了研究结果的含义,并提请人们注意提高HPC系统上Spark性能的新方法(例如,突发缓冲区)。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号