首页> 外文学位 >Compiler optimizations for SIMD/GPU/multicore architectures.
【24h】

Compiler optimizations for SIMD/GPU/multicore architectures.

机译:针对SIMD / GPU /多核体系结构的编译器优化。

获取原文
获取原文并翻译 | 示例

摘要

In modern computer architectures, both SIMD (single-instruction multiple-data) instruction set extensions and GPUs can be used to accelerate the general purpose applications. In addition, the multicore machines can potentially provide more computation power for high performance computing with increasing number of cores and deeper cache hierarchies. However, writing high-performance codes manually for these architectures is still tedious and difficult. In particular, the unique characteristics of these architectures may not be fully exploited.;Specifically, SIMD instruction set extensions enable the exploitation of a specific type of data parallelism called SLP (Superword Level Parallelism). While prior research shows that significant performance savings are possible when SLP is exploited, placing SIMD instructions in an application code manually can be very difficult and error prone. We propose a novel automated compiler framework for improving superword level parallelism exploitation. The key part of our framework consists of two stages: superword statement generation and data layout optimization. The first stage is our main contribution and has two phases, statement grouping and statement scheduling. of which the primary goals are to increase SIMD parallelism and, more importantly, capture more superword reuses among the superword statements through global data access and reuse pattern analysis. Further, as a complementary optimization, our data layout optimization organizes data in memory space such that the price of memory operations for SLP is minimized. The results from our compiler implementation and tests on two systems indicate performance improvements as high as 15.2% over a state-of-the-art SLP optimization algorithm.;On the other hand, GPUs are also being increasingly used in accelerating general-purpose applications, leading to the emergence of GPGPU architectures. New programming models, e.g., Compute Unified Device Architecture (CUDA), have been proposed to facilitate programming general-purpose computations in GPGPUs. However, writing high-performance CUDA codes manually is still tedious and difficult. In particular, the organization of the data in the memory space can greatly affect the performance due to the unique features of a custom GPGPU memory hierarchy. In this work, we propose an automatic data layout transformation framework to solve the key issues associated with a GPGPU memory hierarchy (i.e., channel skewing, data coalescing, and bank conflicts). Our approach employs a widely applicable strategy based on a novel concept called data localization. Specifically, we try to optimize the layout of the arrays accessed in kernels mapped to GPGPUs, for both the device memory and shared memory, at both coarse grain and fine grain parallelization levels.;In addition, iteration space tiling is an important technique for optimizing loops that constitute a large fraction of execution times in computation kernels of both scientific codes and embedded applications. While tiling has been studied extensively in the context of both uniprocessor and multiprocessor platforms, prior research has paid less attention to tile scheduling, especially when targeting multicore machines with deep on-chip cache hierarchies. We propose a cache hierarchy-aware tile scheduling algorithm for multicore machines, with the purpose of maximizing both horizontal and vertical data reuses in on-chip caches, and balancing the workloads across different cores. This scheduling algorithm is one of the key components in a source-to-source translation tool that we developed for automatic loop parallelization and multithreaded code generation from sequential codes. To the best of our knowledge, this is the first effort that develops a fully-automated tile scheduling strategy customized for on-chip cache topologies of multicore machines.
机译:在现代计算机体系结构中,SIMD(单指令多数据)指令集扩展和GPU均可用于加速通用应用程序。此外,随着内核数量的增加和更深层次的缓存层次结构,多核计算机有可能为高性能计算提供更多的计算能力。但是,为这些架构手动编写高性能代码仍然很繁琐且困难。特别是,这些体系结构的独特特性可能无法得到充分利用。具体来说,SIMD指令集扩展允许利用一种称为SLP(超字级并行)的特定类型的数据并行性。尽管先前的研究表明,利用SLP可以显着节省性能,但手动将SIMD指令放置在应用程序代码中可能非常困难且容易出错。我们提出了一种新颖的自动编译器框架,用于改进超字级并行性开发。我们框架的关键部分包括两个阶段:超字语句生成和数据布局优化。第一阶段是我们的主要贡献,分为两个阶段,语句分组和语句调度。其中的主要目标是增加SIMD并行性,更重要的是,通过全局数据访问和重用模式分析,在超字语句中捕获更多的超字重用。此外,作为补充性优化,我们的数据布局优化将数据存储在内存空间中,从而使SLP的内存操作成本降至最低。我们的编译器实施和在两个系统上进行的测试结果表明,与最新的SLP优化算法相比,性能提高了15.2%。另一方面,GPU也越来越多地用于加速通用应用程序,导致出现了GPGPU架构。已经提出了新的编程模型,例如,计算统一设备架构(CUDA),以促进对GPGPU中的通用计算进行编程。但是,手动编写高性能CUDA代码仍然很繁琐且困难。特别是,由于自定义GPGPU内存层次结构的独特功能,内存空间中数据的组织会极大地影响性能。在这项工作中,我们提出了一种自动数据布局转换框架,以解决与GPGPU内存层次结构相关的关键问题(即通道偏移,数据合并和存储体冲突)。我们的方法基于称为数据本地化的新颖概念采用了广泛适用的策略。具体来说,我们尝试在粗粒度和细粒度并行化级别上针对设备内存和共享内存优化映射到GPGPU的内核中访问的阵列的布局。此外,迭代空间平铺是一种优化的重要技术。循环构成科学代码和嵌入式应用程序的计算内核中很大一部分执行时间。虽然在单处理器和多处理器平台的背景下对切片进行了广泛的研究,但先前的研究很少关注切片调度,尤其是在针对具有深层片上缓存层次结构的多核计算机时。我们提出了一种用于多核计算机的缓存层次结构感知切片调度算法,其目的是最大程度地提高片上缓存中的水平和垂直数据重用,并平衡不同内核之间的工作负载。这种调度算法是源到源转换工具中的关键组件之一,我们为自动循环并行化和从顺序代码生成多线程代码而开发了该工具。据我们所知,这是开发针对多核计算机的片上缓存拓扑而定制的全自动切片调度策略的第一项努力。

著录项

  • 作者

    liu, Jun.;

  • 作者单位

    The Pennsylvania State University.;

  • 授予单位 The Pennsylvania State University.;
  • 学科 Computer Science.;Engineering Computer.
  • 学位 Ph.D.
  • 年度 2013
  • 页码 99 p.
  • 总页数 99
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号