...
首页> 外文期刊>Parallel and Distributed Systems, IEEE Transactions on >GPU Acceleration of Runge-Kutta Integrators
【24h】

GPU Acceleration of Runge-Kutta Integrators

机译:Runge-Kutta集成商的GPU加速

获取原文
获取原文并翻译 | 示例

摘要

We consider the use of commodity graphics processing units (GPUs) for the common task of numerically integrating ordinary differential equations (ODEs), achieving speedups of up to 115-fold over comparable serial CPU implementations, and 15-fold over multithreaded CPU code with SIMD intrinsics. Using Lorenz '96 models as a case study, single and double precision benchmarks are established for both the widely used DOPRI5 method and computationally tailored low-storage {rm RK}4(3)5[2{rm R}+]{rm C}. A range of configurations are assessed on each, including multithreading and SIMD intrinsics on the CPU, and GPU kernels parallelized over both the dimensionality of the ODE system and number of trajectories. On the GPU, we draw particular attention to the problem of variable task-length among threads of the same warp, proposing a lightweight strategy of assigning multiple data items to each thread to reduce the prevalence of redundant operations. A simple analysis suggests that the strategy can draw performance close to that of ideal parallelism, while empirical results demonstrate up to a 10 percent improvement over the standard approach.
机译:我们考虑使用商品图形处理单元(GPU)来完成对常微分方程(ODE)进行数值积分的常见任务,与类似的串行CPU实现相比,可实现高达115倍的加速,与SIMD相比,多线程CPU代码可实现15倍的加速本质。使用Lorenz '96模型作为案例研究,为广泛使用的DOPRI5方法和计算定制的低存储{rm RK} 4(3)5 [2 {rm R} +] {rm C建立了单精度和双精度基准}。每个配置都会评估一系列配置,包括CPU上的多线程和SIMD内在函数,以及在ODE系统的维数和轨迹数上都并行化的GPU内核。在GPU上,我们特别注意相同扭曲线程之间任务长度可变的问题,提出了为每个线程分配多个数据项的轻量级策略,以减少冗余操作的普遍性。一个简单的分析表明,该策略可以使性能接近理想的并行性,而经验结果表明,该方法比标准方法提高了10%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号