首页> 外文会议>Parallel Processing Workshops, 2009. ICPPW '09 >CUDA Memory Optimizations for Large Data-Structures in the Gravit Simulator
【24h】

CUDA Memory Optimizations for Large Data-Structures in the Gravit Simulator

机译:Gravit Simulator中针对大型数据结构的CUDA内存优化

获取原文

摘要

Modern GPUs open a completely new field to optimize embarrassingly parallel algorithms. Implementing an algorithm on a GPU confronts the programmer with a new set of challenges for program optimization. Some of the most notable ones are isolating the part of the algorithm that can be optimized to run on the GPU; tuning the program for the GPU memory hierarchy whose organization and performance implications are radically different from those of general purpose CPUs; and optimizing programs at the instruction-level for the GPU. This paper makes two contributions to the performance optimizations for GPUs. We analyze different approaches for optimizing the memory usage and access patterns for GPUs and propose a class of memory layout optimizations that can take full advantage of the unique memory hierarchy of NVIDIA CUDA. Furthermore, we analyze the performance increase by fully unrolling the innermost loop of the algorithm and propose guidelines on how to best unroll a program for the GPU. In particular, even that loop unrolling is a common optimization, the performance improvement on a GPU derives from a completely different aspect of this architecture. To demonstrate these optimizations, we picked an embarrassingly parallel algorithm used to calculate gravitational forces. This algorithm allows us to demonstrate and to explain the performance increase gained by the applied optimizations. Our results show that our approach is quite effective. After applying our technique to the algorithm used in the Gravit gravity simulator, we observed a 1.27x speedup compared to the baseline GPU implementation. This represents a 87x speedup to the original CPU implementation.
机译:现代GPU开辟了一个全新的领域,以优化令人尴尬的并行算法。在GPU上实现算法使程序员面临着程序优化方面的新挑战。一些最引人注目的是隔离可以优化以在GPU上运行的算法部分;针对GPU内存层次结构调整程序,其组织和性能影响与通用CPU根本不同;在GPU的指令级上优化程序。本文为GPU的性能优化做出了两点贡献。我们分析了用于优化GPU的内存使用和访问模式的不同方法,并提出了一类内存布局优化方法,可以充分利用NVIDIA CUDA的独特内存层次结构。此外,我们通过完全展开算法的最内层循环来分析性能提升,并提出有关如何最佳地展开GPU程序的指南。特别是,即使循环展开是一个常见的优化,GPU的性能改进也源于该体系结构的完全不同的方面。为了演示这些优化,我们选择了一种令人尴尬的并行算法来计算重力。该算法使我们能够演示和解释通过应用优化获得的性能提升。我们的结果表明,我们的方法非常有效。将我们的技术应用于Gravit重力模拟器中使用的算法后,与基准GPU实施相比,我们观察到了1.27倍的加速。这表示原始CPU实现的速度提高了87倍。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号