首页> 外文学位 >Hardware Support for Productive Partitioned Global Address Space (PGAS) Programming.
【24h】

Hardware Support for Productive Partitioned Global Address Space (PGAS) Programming.

机译:对生产性分区全局地址空间(PGAS)编程的硬件支持。

获取原文
获取原文并翻译 | 示例

摘要

In order to exploit the increasing number of transistors, and due to the limitations of frequency scaling, the number of cores inside a chip keeps growing. As many-core chips become ubiquitous, there is a greater need for a more productive and efficient parallel programming model. The easy-to-use, but locality-agnostic, shared memory model (e.g. OpenMP) is unable to efficiently exploit memory locality in systems with Non-Uniform Memory Access (NUMA) and Non-Uniform Cache-Access (NUCA) effects. The locality-aware, but explicit, message-passing model (e.g. MPI1) does not provide a productive development environment due to its two-sided communication and a distributed (and isolated) memory model.;The Partitioned Global Address Space (PGAS) programming model strikes a balance between those two extremes via a global address space that is provided for ease-of-use, but is partitioned for locality awareness. The user-friendly PGAS memory model, however, comes at a performance cost, due to the needed address mapping, which can hinder its potential for performance. To mitigate this overhead and achieve full performance, compiler optimizations may be applied, but are often insufficient. Alternatively, manual optimizations can be applied but they are quite cumbersome and, as such, are unproductive. As a result, the overall benefit of PGAS has been severely limited. In this dissertation, we improved both the productivity and performance of PGAS by introducing a novel hardware support. This PGAS hardware support efficiently handles the complex PGAS mapping and communication without the intervention of an application developer. By introducing the new hardware at the micro-architecture level, fine grain and low latency local shared memory accesses are supported. The hardware is also made available through an ISA extension, so that it can easily be exploited by PGAS compilers to efficiently access and traverse the PGAS memory space. The automatic code generation eliminates the need for hand-tuning, and thus simultaneously improve both the performance and productivity of PGAS languages. This research also introduces and evaluates the possibility for the hardware support to handle a variety of PGAS languages.;Results are obtained on two different system implementations: the first is based on the well-adopted full system simulator Gem5, which allows the precise evaluation of the performance gain. Two prototype compilers supporting the new hardware are created for experimentation by extending the Berkeley Unified Parallel C (UPC) compiler and the Cray Chapel compiler. This allows unmodified code to use the new instructions without any user intervention, thereby creating a productive programming environment. The second proof-of-concept implementation is a hardware prototype based on the multi-core Leon3 softcore processor running on a Virtex-6 FPGA. This allowed us to not only verify the feasibility of the implementation but also to evaluate the cost of the new hardware and its instructions.;This research has shown very promising results. With benchmarks in UPC and Chapel including the NAS Parallel Benchmarks implemented in UPC, a speedup of up to 5.5x is demonstrated when using the hardware support with unmodified codes. Unmodified code performance using this hardware was shown to also surpass the performance of manually optimized UPC code in some of the cases by up to 10%. With Chapel, we obtained measurable speed-ups of up to 19x. Additionally, the hardware prototype demonstrated that only a very small area increase is needed.
机译:为了利用越来越多的晶体管,并且由于频率缩放的限制,芯片内部的核数一直在增长。随着许多核芯片无处不在,对更高生产率和效率的并行编程模型的需求越来越大。在具有非统一内存访问(NUMA)和非统一缓存访问(NUCA)效果的系统中,易于使用但与位置无关的共享内存模型(例如OpenMP)无法有效利用内存位置。本地感知但显式的消息传递模型(例如MPI1)由于其双向通信和分布式(和隔离的)内存模型而无法提供高效的开发环境。分区全局地址空间(PGAS)编程通过提供易于使用的全局地址空间,该模型在这两种极端之间取得了平衡,但为了进行局部性感知而对其进行了分区。但是,由于需要地址映射,因此用户友好的PGAS内存模型会以性能为代价,这可能会阻碍其性能潜力。为了减轻这种开销并获得完整的性能,可以应用编译器优化,但是常常是不够的。或者,可以应用手动优化,但是它们很麻烦,因此效率低下。结果,PGAS的整体利益已受到严重限制。本文通过引入新颖的硬件支持,提高了PGAS的生产率和性能。该PGAS硬件支持可有效处理复杂的PGAS映射和通信,而无需应用程序开发人员的干预。通过在微体系结构级别引入新硬件,可以支持精细粒度和低延迟的本地共享内存访问。该硬件还可以通过ISA扩展提供,因此PGAS编译器可以轻松利用它来有效地访问和遍历PGAS内存空间。自动代码生成消除了手动调整的需要,因此同时提高了PGAS语言的性能和生产率。这项研究还介绍并评估了硬件支持处理多种PGAS语言的可能性。在两种不同的系统实现中获得了结果:第一种是基于广为采用的完整系统仿真器Gem5,它可以对Pem进行精确评估。性能提升。通过扩展Berkeley统一并行C(UPC)编译器和Cray Chapel编译器,创建了两个支持新硬件的原型编译器进行实验。这允许未经修改的代码在没有任何用户干预的情况下使用新指令,从而创建了高效的编程环境。第二个概念验证的实现是基于在Virtex-6 FPGA上运行的多核Leon3软核处理器的硬件原型。这不仅使我们能够验证实施的可行性,而且能够评估新硬件及其指令的成本。通过UPC和Chapel中的基准测试(包括UPC中实现的NAS并行基准测试),将硬件支持与未修改的代码一起使用时,可以将速度提高5.5倍。在某些情况下,使用此硬件的未修改代码性能也比手动优化的UPC代码性能高出10%。通过Chapel,我们获得了高达19倍的可测量加速。另外,硬件原型证明只需要很小的面积即可增加。

著录项

  • 作者

    Serres, Olivier.;

  • 作者单位

    The George Washington University.;

  • 授予单位 The George Washington University.;
  • 学科 Computer engineering.
  • 学位 Ph.D.
  • 年度 2016
  • 页码 122 p.
  • 总页数 122
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

  • 入库时间 2022-08-17 11:48:54

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号