...
首页> 外文期刊>Journal of Parallel and Distributed Computing >Locality optimized unstructured mesh algorithms on GPUs
【24h】

Locality optimized unstructured mesh algorithms on GPUs

机译:在GPU上进行局部性优化的非结构化网格算法

获取原文
获取原文并翻译 | 示例
           

摘要

Unstructured-mesh based numerical algorithms such as finite volume and finite element algorithms form an important class of applications for many scientific and engineering domains. The key difficulty in achieving higher performance from these applications is the indirect accesses that lead to data-races when parallelized. Current methods for handling such data-races lead to reduced parallelism and suboptimal performance. Particularly on modern many-core architectures, such as GPUs, that has increasing core/thread counts, reducing data movement and exploiting memory locality is vital for gaining good performance.In this work we present novel locality-exploiting optimizations for the efficient execution of unstructured-mesh algorithms on GPUs. Building on a two-layered coloring strategy for handling data races, we introduce novel reordering and partitioning techniques to further improve efficient execution. The new optimizations are then applied to several well established unstructured-mesh applications, investigating their performance on NVIDIA's latest P100 and V100 GPUs. We demonstrate significant speedups (1.1-1.75 x) compared to the state-of-the-art. A range of performance metrics are benchmarked including runtime, memory transactions, achieved bandwidth performance, GPU occupancy and data reuse factors and are used to understand and explain the key factors impacting performance. The optimized algorithms are implemented as an open-source software library and we illustrate its use for improving performance of existing or new unstructured-mesh applications. (C) 2019 Elsevier Inc. All rights reserved.
机译:基于非结构化网格的数值算法,例如有限体积和有限元算法,构成了许多科学和工程领域的重要应用类别。从这些应用程序获得更高性能的关键困难是并行访问导致数据争用的间接访问。用于处理此类数据争用的当前方法导致并行度降低和性能欠佳。尤其是在诸如GPU之类的现代多核体系结构上,不断增加的内核/线程数,减少数据移动和利用内存局部性对于获得良好的性能至关重要。在这项工作中,我们提出了新颖的局部性优化技术来有效执行非结构化GPU上的网格算法。基于用于处理数据竞争的两层着色策略,我们引入了新颖的重新排序和分区技术,以进一步提高有效执行效率。然后,将这些新的优化应用于几个完善的非结构化网格应用程序,以研究它们在NVIDIA最新的P100和V100 GPU上的性能。与最新技术相比,我们展示了显着的加速效果(1.1-1.75 x)。对一系列性能指标进行了基准测试,包括运行时,内存事务,已实现的带宽性能,GPU占用率和数据重用因素,并用于理解和解释影响性能的关键因素。优化的算法被实现为一个开放源代码的软件库,我们将举例说明它如何用于提高现有或新的非结构化网格应用程序的性能。 (C)2019 Elsevier Inc.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号