首页> 外文学位 >Data layout transformation through in-place transposition.
【24h】

Data layout transformation through in-place transposition.

机译:通过就地换位进行数据布局转换。

获取原文
获取原文并翻译 | 示例

摘要

Matrix transposition is an important algorithmic building block for many numeric algorithms like multidimensional FFT. It has also been used to convert the storage layout of arrays. Intuitively, in-place transposition should be a good fit for GPU architectures due to limited available on-board memory capacity and high throughput. However, direct application of in-place transposition algorithms from CPU lacks the amount of parallelism and locality required by GPU to achieve good performance.;In this thesis we present the first known in-place matrix transposition approach for the GPUs. Our implementation is based on a staged transposition algorithm where each stage is performed using an elementary tiled-wise transposition. With both low-level optimizations to the elementary tiled-wise transpositions as well as high-level improvements to existing staged transposition algorithm, our design is able to reach more than 20GB/s sustained throughput on modern GPUs, and a 3X speedup.;Furthermore, for many-core architectures like the GPUs, efficient off-chip memory access is crucial to high performance; the applications are often limited by off-chip memory bandwidth. Transforming data layout is an effective way to reshape the access patterns to improve off-chip memory access behavior, but several challenges had limited the use of automated data layout transformation systems on GPUs, namely how to efficiently handle arrays of aggregates, and transparently marshal data between layouts required by different performance sensitive kernels and legacy host code. While GPUs have higher memory bandwidth and are natural candidates for marshaling data between layouts, the relatively constrained GPU memory capacity, compared to that of the CPU, implies that not only the temporal cost of marshaling but also the spatial overhead must be considered for any practical layout transformation systems.;As an application of the in-place transposition methodology, a novel approach to laying out arrays of aggregate types across GPU and CPU architectures is proposed to further improve memory parallelism and kernel performance beyond what is achieved by human programmers using discrete arrays today.;Second, the system, DL, has a run-time library implemented in OpenCL that transparently and efficiently converts, or marshals, data to accommodate application components that have different data layout requirements. We present insights that lead to the design of this highly efficient run-time marshaling library. Third, we show experimental results that the new layout approach leads to substantial performance improvement at the applications level even when all marshaling cost is taken into account.
机译:对于许多数字算法(例如多维FFT),矩阵转置是重要的算法构建块。它也已用于转换阵列的存储布局。凭直觉,由于板载可用内存有限和高吞吐量,就地换位应非常适合GPU架构。然而,直接从CPU中应用就地置换算法缺乏GPU达到良好性能所需的并行性和局部性。在本文中,我们提出了第一个已知的GPU就地矩阵置换方法。我们的实现基于分阶段转置算法,其中每个阶段都使用基本的平铺转置进行。通过对基本平铺转置的低级优化以及对现有分段转置算法的高级改进,我们的设计能够在现代GPU上实现超过20GB / s的持续吞吐量,并实现3倍的加速。对于GPU等许多核心架构,高效的片外内存访问对高性能至关重要。应用通常受到片外存储器带宽的限制。转换数据布局是重塑访问模式以改善片外内存访问行为的有效方法,但是一些挑战限制了在GPU上使用自动数据布局转换系统,即如何有效处理聚合数组以及透明地封送数据在不同的性能敏感内核所需的布局与旧版主机代码之间进行选择。尽管GPU具有更高的内存带宽并且是在布局之间封送数据的自然候选者,但与CPU相比,GPU内存容量相对受限制,这意味着对于任何实际应用,不仅必须考虑封送的时间成本,而且还必须考虑空间开销布局转换系统。作为就地换位方法的应用,提出了一种新颖的方法来跨GPU和CPU架构布置聚合类型的数组,以进一步提高内存并行性和内核性能,这超出了人类程序员使用离散量实现的水平。第二,系统DL具有在OpenCL中实现的运行时库,该库可以透明有效地转换或封送数据,以适应具有不同数据布局要求的应用程序组件。我们提供了一些见识,从而可以设计出这种高效的运行时编组库。第三,我们显示了实验结果,即使考虑了所有封送处理成本,新的布局方法仍可以在应用程序级别上显着提高性能。

著录项

  • 作者

    Sung, I-Jui.;

  • 作者单位

    University of Illinois at Urbana-Champaign.;

  • 授予单位 University of Illinois at Urbana-Champaign.;
  • 学科 Engineering Computer.
  • 学位 Ph.D.
  • 年度 2013
  • 页码 118 p.
  • 总页数 118
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号