...
首页> 外文期刊>International journal of parallel programming >Efficient Processing of Large Data Structures on GPUs: Enumeration Scheme Based Optimisation
【24h】

Efficient Processing of Large Data Structures on GPUs: Enumeration Scheme Based Optimisation

机译:在GPU上高效处理大数据结构:基于枚举方案的优化

获取原文
获取原文并翻译 | 示例
           

摘要

The purpose of this paper is to highlight the performance issues of the matrix transposition algorithms for large matrices, relating to the Translation Lookaside Buffer (TLB) cache. The existing optimisation techniques such as coalesced access and the use of shared memory, regardless of their necessity and benefits, are not sufficient enough to neutralise the problem. As the data problem size increases, these optimisations do not exploit data locality effectively enough to counteract the detrimental effects of TLB cache misses. We propose a new optimisation technique that counteracts the performance degradation of these algorithms and seamlessly complements current optimisations. Our optimisation is based on detailed analysis of enumeration schemes that can be applied to either individual matrix entries or blocks (sub-matrices). The key advantage of these enumeration schemes is that they do not incur matrix storage format conversion because they operate on canonical matrix layouts. In addition, several cache-efficient matrix transposition algorithms based on enumeration schemes are offered—an improved version of the in-place algorithm for square matrices, out-of-place algorithm for rectangular matrices and two 3D involutions. We demonstrate that the choice of the enumeration schemes and their parametrisation can have a direct and significant impact on the algorithm’s memory access pattern. Our in-place version of the algorithm delivers up to 100% performance improvement over the existing optimisation techniques. Meanwhile, for the out-of-place version we observe up to 300% performance gain over the NVidia’s algorithm. We also offer improved versions of two involution transpositions for the 3D matrices that can achieve performance increase up 300%. To the best of our knowledge, this is the first effective attempt to control the logical-to-physical block association through the design of enumeration schemes in the context of matrix transposition.
机译:本文的目的是重点介绍与转换后备缓冲区(TLB)缓存有关的大型矩阵转换算法的性能问题。现有的优化技术(例如合并访问和共享内存的使用),无论其必要性和益处如何,都不足以消除问题。随着数据问题大小的增加,这些优化无法充分有效地利用数据局部性来抵消TLB缓存未命中的不利影响。我们提出了一种新的优化技术,该技术可抵消这些算法的性能下降并无缝补充当前的优化。我们的优化基于对枚举方案的详细分析,该枚举方案可应用于单个矩阵条目或块(子矩阵)。这些枚举方案的主要优点是它们不会进行矩阵存储格式转换,因为它们在规范的矩阵布局上运行。此外,还提供了几种基于枚举方案的高速缓存有效的矩阵转置算法-方形矩阵就地算法,矩形矩阵就地算法和两个3D对合的改进版本。我们证明了枚举方案的选择及其参数化可以对算法的内存访问模式产生直接而重大的影响。与现有的优化技术相比,我们的就地算法版本可将性能提高100%。同时,对于非常规版本,我们观察到与NVidia算法相比,性能提高了300%。我们还为3D矩阵提供了两个对合换位的改进版本,可以使性能提高300%。据我们所知,这是在矩阵转置的情况下通过枚举方案设计来控制逻辑与物理块关联的第一次有效尝试。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号