...
首页> 外文期刊>Journal of Parallel and Distributed Computing >Efficient heterogeneous execution on large multicore and accelerator platforms: Case study using a block tridiagonal solver
【24h】

Efficient heterogeneous execution on large multicore and accelerator platforms: Case study using a block tridiagonal solver

机译:大型多核和加速器平台上的高效异构执行:使用块三对角求解器的案例研究

获取原文
获取原文并翻译 | 示例
           

摘要

The algorithmic and implementation principles are explored in gainfully exploiting GPU accelerators in conjunction with multicore processors on high-end systems with large numbers of compute nodes, and evaluated in an implementation of a scalable block tridiagonal solver. The accelerator of each compute node is exploited in combination with multicore processors of that node in performing block-level linear algebra operations in the overall, distributed solver algorithm. Optimizations incorporated include: (1) an efficient memory mapping and synchronization interface to minimize data movement, (2) multiprocess sharing of the accelerator within a node to obtain balanced load with multicore processors, and (3) an automatic memory management system to efficiently utilize accelerator memory when sub-matrices spill over the limits of device memory. Results are reported from our novel implementation that uses MAGMA and CUBLAS accelerator software systems simultaneously with ACML (2013) [2] for multithreaded execution on processors. Overall, using 940 nVidia Tesla X2090 accelerators and 15,040 cores, the best heterogeneous execution delivers a 10.9-fold reduction in run time relative to an already efficient parallel multicore-only baseline implementation that is highly optimized with intra-node and inter-node concurrency and computation-communication overlap. Detailed quantitative results are presented to explain all critical runtime components contributing to hybrid performance.
机译:在具有大量计算节点的高端系统上,结合多核处理器对GPU加速器进行了有益的探索,探索了算法和实现原理,并在可扩展块三对角求解器的实现中进行了评估。在整个分布式求解器算法中执行块级线性代数运算时,将与每个计算节点的加速器结合使用该节点的多核处理器。纳入的优化包括:(1)有效的内存映射和同步接口,以最大程度地减少数据移动;(2)节点内加速器的多进程共享,以通过多核处理器获得均衡的负载;(3)自动内存管理系统,以有效利用子矩阵超出设备内存限制时的加速器内存。我们新颖的实现报告了结果,该实现同时使用MAGMA和CUBLAS加速器软件系统以及ACML(2013)[2]在处理器上执行多线程。总体而言,使用940 nVidia Tesla X2090加速器和15,040个内核,相对于已通过节点内和节点间并发和计算-通信重叠。给出了详细的定量结果,以解释有助于混合性能的所有关键运行时组件。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号