首页> 外文期刊>IEEE Transactions on Parallel and Distributed Systems >LU Factorization with Partial Pivoting for a Multicore System with Accelerators
【24h】

LU Factorization with Partial Pivoting for a Multicore System with Accelerators

机译:具有加速器的多核系统的部分透视LU分解

获取原文
获取原文并翻译 | 示例

摘要

LU factorization with partial pivoting is a canonical numerical procedure and the main component of the high performance LINPACK benchmark. This paper presents an implementation of the algorithm for a hybrid, shared memory, system with standard CPU cores and GPU accelerators. The difficulty of implementing the algorithm for such a system lies in the disproportion between the computational power of the CPUs, compared to the GPUs, and in the meager bandwidth of the communication link between their memory systems. An additional challenge comes from the complexity of the memory-bound and synchronization-rich nature of the panel factorization component of the block LU algorithm, imposed by the use of partial pivoting. The challenges are tackled with the use of a data layout geared toward complex memory hierarchies, autotuning of GPU kernels, fine-grain parallelization of memory-bound CPU operations and dynamic scheduling of tasks to different devices. Performance in excess of one TeraFLOPS is achieved using four AMD Magny Cours CPUs and four NVIDIA Fermi GPUs.
机译:带有部分枢轴的LU分解是一种规范的数值过程,也是高性能LINPACK基准测试的主要组成部分。本文介绍了具有标准CPU内核和GPU加速器的混合共享内存系统的算法实现。对于这样的系统,实现算法的困难在于,与GPU相比,CPU的计算能力之间不相称,并且它们的内存系统之间的通信链路的带宽很小。另一个挑战来自块LU算法的面板分解因子组件的内存绑定和同步丰富特性的复杂性,这是由于使用了部分枢轴而造成的。通过使用面向复杂内存层次结构的数据布局,GPU内核自动调整,内存绑定CPU操作的细粒度并行化以及对不同设备的任务动态调度,可以解决这些挑战。使用四个AMD Magny Cours CPU和四个NVIDIA Fermi GPU,可实现超过一个TeraFLOPS的性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号