首页> 外文会议>IEEE International Parallel Distributed Processing Symposium >XKaapi: A Runtime System for Data-Flow Task Programming on Heterogeneous Architectures
【24h】

XKaapi: A Runtime System for Data-Flow Task Programming on Heterogeneous Architectures

机译:XKaapi:用于异构体系结构上的数据流任务编程的运行时系统

获取原文

摘要

Most recent HPC platforms have heterogeneous nodes composed of multi-core CPUs and accelerators, like GPUs. Programming such nodes is typically based on a combination of OpenMP and CUDA/OpenCL codes, scheduling relies on a static partitioning and cost model. We present the XKaapi runtime system for data-flow task programming on multi-CPU and multi-GPU architectures, which supports a data-flow task model and a locality-aware work stealing scheduler. XKaapi enables task multi-implementation on CPU or GPU and multi-level parallelism with different grain sizes. We show performance results on two dense linear algebra kernels, matrix product (GEMM) and Cholesky factorization (POTRF), to evaluate XKaapi on a heterogeneous architecture composed of two hexa-core CPUs and eight NVIDIA Fermi GPUs. Our conclusion is two-fold. First, fine grained parallelism and online scheduling achieve performance results as good as static strategies, and in most cases outperform them. This is due to an improved work stealing strategy that includes locality information, a very light implementation of the tasks in XKaapi, and an optimized search for ready tasks. Next, the multi-level parallelism on multiple CPUs and GPUs enabled by XKaapi led to a highly efficient Cholesky factorization. Using eight NVIDIA Fermi GPUs and four CPUs, we measure up to 2.43 TFlop/s on double precision matrix product and 1.79 TFlop/s on Cholesky factorization, and respectively 5.09 TFlop/s and 3.92 TFlop/s in single precision.
机译:最新的HPC平台具有由多核CPU和加速器(如GPU)组成的异构节点。对此类节点进行编程通常基于OpenMP和CUDA / OpenCL代码的组合,调度依赖于静态分区和成本模型。我们介绍了用于在多CPU和多GPU架构上进行数据流任务编程的XKaapi运行时系统,该系统支持数据流任务模型和可感知位置的工作窃取调度程序。 XKaapi支持在CPU或GPU上执行任务多实施以及具有不同粒度的多级并行性。我们在两个密集线性代数内核,矩阵乘积(GEMM)和Cholesky因式分解(POTRF)上显示了性能结果,以在由两个六核CPU和八个NVIDIA Fermi GPU组成的异构体系结构上评估XKaapi。我们的结论有两个方面。首先,细粒度的并行性和在线调度可以获得与静态策略一样好的性能结果,并且在大多数情况下都优于静态策略。这是由于改进了的工作窃取策略,其中包括位置信息,XKaapi中任务的非常轻实的实现以及对就绪任务的优化搜索。接下来,XKaapi支持在多个CPU和GPU上进行多级并行处理,从而实现了高效的Cholesky分解。使用八个NVIDIA Fermi GPU和四个CPU,我们在双精度矩阵乘积上的测量最高速度为2.43 TFlop / s,在Cholesky因子分解中测得的速度为1.79 TFlop / s,在单精度下分别为5.09 TFlop / s和3.92 TFlop / s。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号