Efficient Processing of Large Data Structures on GPUs: Enumeration Scheme Based Optimisation

Marcin Gorawski; Michal Lorek

首页> 外文期刊>International journal of parallel programming >Efficient Processing of Large Data Structures on GPUs: Enumeration Scheme Based Optimisation

【24h】

Efficient Processing of Large Data Structures on GPUs: Enumeration Scheme Based Optimisation

机译：在GPU上高效处理大数据结构：基于枚举方案的优化

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

The purpose of this paper is to highlight the performance issues of the matrix transposition algorithms for large matrices, relating to the Translation Lookaside Buffer (TLB) cache. The existing optimisation techniques such as coalesced access and the use of shared memory, regardless of their necessity and benefits, are not sufficient enough to neutralise the problem. As the data problem size increases, these optimisations do not exploit data locality effectively enough to counteract the detrimental effects of TLB cache misses. We propose a new optimisation technique that counteracts the performance degradation of these algorithms and seamlessly complements current optimisations. Our optimisation is based on detailed analysis of enumeration schemes that can be applied to either individual matrix entries or blocks (sub-matrices). The key advantage of these enumeration schemes is that they do not incur matrix storage format conversion because they operate on canonical matrix layouts. In addition, several cache-efficient matrix transposition algorithms based on enumeration schemes are offered—an improved version of the in-place algorithm for square matrices, out-of-place algorithm for rectangular matrices and two 3D involutions. We demonstrate that the choice of the enumeration schemes and their parametrisation can have a direct and significant impact on the algorithm’s memory access pattern. Our in-place version of the algorithm delivers up to 100% performance improvement over the existing optimisation techniques. Meanwhile, for the out-of-place version we observe up to 300% performance gain over the NVidia’s algorithm. We also offer improved versions of two involution transpositions for the 3D matrices that can achieve performance increase up 300%. To the best of our knowledge, this is the first effective attempt to control the logical-to-physical block association through the design of enumeration schemes in the context of matrix transposition.

机译：本文的目的是重点介绍与转换后备缓冲区（TLB）缓存有关的大型矩阵转换算法的性能问题。现有的优化技术（例如合并访问和共享内存的使用），无论其必要性和益处如何，都不足以消除问题。随着数据问题大小的增加，这些优化无法充分有效地利用数据局部性来抵消TLB缓存未命中的不利影响。我们提出了一种新的优化技术，该技术可抵消这些算法的性能下降并无缝补充当前的优化。我们的优化基于对枚举方案的详细分析，该枚举方案可应用于单个矩阵条目或块（子矩阵）。这些枚举方案的主要优点是它们不会进行矩阵存储格式转换，因为它们在规范的矩阵布局上运行。此外，还提供了几种基于枚举方案的高速缓存有效的矩阵转置算法-方形矩阵就地算法，矩形矩阵就地算法和两个3D对合的改进版本。我们证明了枚举方案的选择及其参数化可以对算法的内存访问模式产生直接而重大的影响。与现有的优化技术相比，我们的就地算法版本可将性能提高100％。同时，对于非常规版本，我们观察到与NVidia算法相比，性能提高了300％。我们还为3D矩阵提供了两个对合换位的改进版本，可以使性能提高300％。据我们所知，这是在矩阵转置的情况下通过枚举方案设计来控制逻辑与物理块关联的第一次有效尝试。

著录项

来源
《International journal of parallel programming》 |2018年第6期|1063-1093|共31页
作者
Marcin Gorawski; Michal Lorek;
展开▼
作者单位

Faculty of Automatic Control, Electronics, and Computer Science, Institute of Informatics, Silesian University of Technology;

Faculty of Automatic Control, Electronics, and Computer Science, Institute of Informatics, Silesian University of Technology;

展开▼
收录信息美国《科学引文索引》(SCI);美国《工程索引》(EI);
原文格式 PDF
正文语种 eng
中图分类
关键词
Matrix transposition algorithm; Parallel computing; TLB thrashing; Pairing function; Enumeration scheme; Block logical reordering;

机译：矩阵换位算法;并行计算;TLB抖动;配对函数;枚举方案;块逻辑重排;

相似文献

外文文献
中文文献
专利

1. An efficient geosciences workflow on multi-core processors and GPUs: a case study for aerosol optical depth retrieval from MODIS satellite data [J] . Liu Jia, Feld Dustin, Xue Yong, International journal of digital Earth . 2016,第7a9期

机译：多核处理器和GPU上高效的地球科学工作流程：以MODIS卫星数据中的气溶胶光学深度检索为例的研究
2. Efficient, High-Quality, GPU-Based Visualization of Voxelized Surface Data with Fine and Complicated Structures [J] . Sven FORSTMANN, Jun OHYA IEICE transactions on information and systems . 2010,第11期

机译：具有精细复杂结构的体素化曲面数据的高效，高质量，基于GPU的可视化
3. Efficient, High-Quality, GPU-Based Visualization of Voxelized Surface Data with Fine and Complicated Structures [J] . Sven FORSTMANN, Jun OHYA IEICE Transactions on Information and Systems . 2010,第11期

机译：具有精细复杂结构的体素化曲面数据的高效，高质量，基于GPU的可视化
4. A data encryption scheme and GPU-based query processing algorithm for spatial data outsourcing [C] . Min Yoon, Ahra Cho, Miyoung Jang, International Conference on Big Data and Smart Computing . 2015

机译：一种空间数据外包的数据加密方案和基于GPU的查询处理算法
5. GPU-based mapreduce schemes for big data processing. [D] . Chen, Yi. 2013

机译：基于GPU的mapreduce方案用于大数据处理。
6. Accelerating Smith-Waterman Alignment for Protein Database Search Using Frequency Distance Filtration Scheme Based on CPU-GPU Collaborative System [O] . Yu Liu, Yang Hong, Chun-Yuan Lin, 2015

机译：基于CPU-GPU协同系统的频率距离过滤方案加速Smith-Waterman比对用于蛋白质数据库搜索
7. An efficient geosciences workflow on multi-core processors and GPUs: a case study for aerosol optical depth retrieval from MODIS satellite data [O] . Liu, Jia, Feld, Dustin, Xue, Yong, 2016

机译：在多核处理器和GPU上进行有效的地球科学工作流程：从MODIS卫星数据中获取气溶胶光学深度的案例研究

Efficient Processing of Large Data Structures on GPUs: Enumeration Scheme Based Optimisation

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅